La práctica a desarrollar consiste en la elaboración y presentación de un informe de un proyecto de Ciencia de Datos, utilizando las técnicas aprendidas durante la asignatura, aplicadas al conjunto de datos seleccionados.
El grupo eligió trabajar en lenguage R (RStudio version 1.4.1717) y utilizar como herramienta de control de versiones GitHub. El proyecto “/practica_final_ml1” fue creado por LuisaYánez (usuario lyanezgu)y compartido con los restantes participantes del grupo Miguel García (usuario mgarciasanc2021).
Link del proyecto en GitHub: https://github.com/lyanezgu/practica_final_ml1
library(formatR)
library(readr)
library(ggplot2)
library(GGally)
library(dplyr)
library(tidyr)
library(missForest)
library(VIM)
library(formattable)
library(usmap)
library(cowplot)
library(corrplot)
library(MASS)
library(ggfortify)
library(nortest)
library(car)
library(lmtest)
library(PerformanceAnalytics)
library(Amelia)
library(ggthemes)
library(tidyverse)
library(tibble)
library(gridExtra)
library(ggbiplot)
library(factoextra)
library(caret)
library(ISLR)
library(rpart)
library(rpart.plot)
library(rattle)
library(tsne)
library(Rtsne)
library(class)
library(ada)
library(factoextra)
library(cluster)
library(useful)
library(mgcv)
library(xgboost)
library(randomForest)
library(kernlab)
library(pROC)
library(doMC)
library(ggpubr)
library(ROCR)
El conjunto de datos elegido por el grupo se llama “Red Wine Quality” e incluye información sobre la variantes de vino tinto dentro del “Vinho Verde” portugués analizándolo y describiéndolo a través de sus características fisicoquímicas y sensoriales.
Link del data set: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009.
El conjunto de datos “Red Wine Quality” contiene 12 columnas y 1599 filas y lo obtenemos en formato .CSV.
Inicialmente se han guardado los datos en un data frame llamado “red_wine” y se ha realizado un estudio inicial sobre su contenido utilizando la función head y summary.
red_wine <- read_csv("winequality-red.csv")
head(red_wine)
## # A tibble: 6 × 12
## `fixed acidity` `volatile acidity` `citric acid` `residual sugar` chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # … with 7 more variables: free sulfur dioxide <dbl>,
## # total sulfur dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
summary(red_wine)
## fixed acidity volatile acidity citric acid residual sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free sulfur dioxide total sulfur dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
Empezando ya el análisis inicial del conjunto de datos que tenemos, vemos que las 12 variables que componen los datos pueden ser descritas como:
Input variables o Varibles de entrada/predictoras (basado en pruebas fisicoquímicas):
Output variable o Variable de salida/respuesta/objetivo (basado en datos sensoriales):
El objetivo final del proyecto es conseguir llegar a un modelo que permita predecir la calidad del vino tinto de la variedad portuguesa de “Vinho Verde” y saber si estamos ante vinos recomendables (aprobados/bebibles) o no recomendables y que se deberían evitar (suspensos/no bebibles).
Se ha decidido realizar un cambio en el nombre de las variables que aparecen en las columnas de los datos para así seguir un mismo patrón y a la vez evitar tener espacios que nos pueden llegar a dar problemas a futuro.
names(red_wine) <- c("fixed_acidity", "volatile_acidity", "citric_acid",
"residual_sugar", "chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide",
"density", "pH", "sulphates", "alcohol", "quality")
head(red_wine)
## # A tibble: 6 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # … with 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
Todas las variables input de las que disponemos en el dataset son de tipo numérico y entendemos que en principio no requieren ninguna transformación en ese sentido.
Cabría la posibilidad de tratar de transformar la variable “quality” (output) en factor para hacerla categórica en función de la calidad del vino (clasificación del vino en números enteros entre el 0 y el 10). Se podría pasar a categorizar el vino como “malo”, “normal” y “bueno”, como “apobado” o “suspenso”, o del 0 al 10 en las diferentes categorías numéricas que vienen predefinidas.
str(red_wine)
## spec_tbl_df [1,599 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile_acidity : num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric_acid : num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual_sugar : num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free_sulfur_dioxide : num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
## $ total_sulfur_dioxide: num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
## - attr(*, "spec")=
## .. cols(
## .. `fixed acidity` = col_double(),
## .. `volatile acidity` = col_double(),
## .. `citric acid` = col_double(),
## .. `residual sugar` = col_double(),
## .. chlorides = col_double(),
## .. `free sulfur dioxide` = col_double(),
## .. `total sulfur dioxide` = col_double(),
## .. density = col_double(),
## .. pH = col_double(),
## .. sulphates = col_double(),
## .. alcohol = col_double(),
## .. quality = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
A través de la función summary empezamos comprobando que no hay datos faltantes en el data set. Por ello el grupo ha tenido que añadirlos manualmente para tratar de aproximarnos a un caso más real donde lo normal es encontrarlos y tener que lidiar con ellos.
Los datos faltantes han sido imputados exclusivamente en las columnas que no creemos que no van a servir de análisis principal para este estudio (pH y sulphates), para así intentar que la predicción que hagamos sea lo más precisa posible.
Utilizamos la librería missForest y generamos una semilla para que el resultado sea siempre el mismo.
red_wine
## # A tibble: 1,599 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## 7 7.9 0.6 0.06 1.6 0.069
## 8 7.3 0.65 0 1.2 0.065
## 9 7.8 0.58 0.02 2 0.073
## 10 7.5 0.5 0.36 6.1 0.071
## # … with 1,589 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
set.seed(101)
red_wine <- bind_cols(red_wine[c(1, 2, 3, 4, 5, 6, 7, 8, 11,
12)], missForest::prodNA(red_wine[c(-1, -2, -3, -4, -5, -6,
-7, -8, -11, -12)], noNA = 0.1))
red_wine
## # A tibble: 1,599 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## 7 7.9 0.6 0.06 1.6 0.069
## 8 7.3 0.65 0 1.2 0.065
## 9 7.8 0.58 0.02 2 0.073
## 10 7.5 0.5 0.36 6.1 0.071
## # … with 1,589 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, alcohol <dbl>, quality <dbl>,
## # pH <dbl>, sulphates <dbl>
Haciendo uso de la librería VIM y de la librería Amelia, analizamos la estructura que tienen nuestros datos faltantes dentro de nuestro data set para ver y entender como se distribuyen y a que variables afecta.
Se puede comprobar que la proporción de datos faltantes en estas variables es de aproximadamente 10% y hay 15 filas en que las dos variables son faltantes.
summary(aggr(red_wine, numbers = T, sortVar = T))
##
## Variables sorted by number of missings:
## Variable Count
## pH 0.10318949
## sulphates 0.09631019
## fixed_acidity 0.00000000
## volatile_acidity 0.00000000
## citric_acid 0.00000000
## residual_sugar 0.00000000
## chlorides 0.00000000
## free_sulfur_dioxide 0.00000000
## total_sulfur_dioxide 0.00000000
## density 0.00000000
## alcohol 0.00000000
## quality 0.00000000
##
## Missings per variable:
## Variable Count
## fixed_acidity 0
## volatile_acidity 0
## citric_acid 0
## residual_sugar 0
## chlorides 0
## free_sulfur_dioxide 0
## total_sulfur_dioxide 0
## density 0
## alcohol 0
## quality 0
## pH 165
## sulphates 154
##
## Missings in combinations of variables:
## Combinations Count Percent
## 0:0:0:0:0:0:0:0:0:0:0:0 1295 80.9881176
## 0:0:0:0:0:0:0:0:0:0:0:1 139 8.6929331
## 0:0:0:0:0:0:0:0:0:0:1:0 150 9.3808630
## 0:0:0:0:0:0:0:0:0:0:1:1 15 0.9380863
missmap(red_wine, main = "Missing Map")
Una vez vistos por encima la estructura general de los datos y habiendo añadido los datos faltantes que nos hacian falta, pasamos a dividir el conjunto de datos en dos para diferenciar los que usaremos de entrenamiento de los que usaremos de test (viendo la cantidad de datos de la que disponemos, la distribución elegida ha sido: 20% test y 80% training). Establecemos una semilla que nos guarde de forma permanente la división que hacemos para que la distribución de los datos sea siempre la misma.
Guardamos además la partición de datos de test para ser utilizada a futuro para la validación del modelo final y pasamos a trabajar de aquí en adelante con la partición de training.
set.seed(101)
sample <- sample.int(n = nrow(red_wine), size = floor(0.8 * nrow(red_wine)),
replace = F)
train <- red_wine[sample, ]
test <- red_wine[-sample, ]
train
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 2.8 0.068
## 2 7.6 0.49 0.33 1.9 0.074
## 3 5 1.02 0.04 1.4 0.045
## 4 7.6 0.43 0.29 2.1 0.075
## 5 6.8 0.59 0.1 1.7 0.063
## 6 6.8 0.815 0 1.2 0.267
## 7 8.5 0.21 0.52 1.9 0.09
## 8 7.4 0.36 0.29 2.6 0.087
## 9 5.5 0.49 0.03 1.8 0.044
## 10 6.8 0.49 0.22 2.3 0.071
## # … with 1,269 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, alcohol <dbl>, quality <dbl>,
## # pH <dbl>, sulphates <dbl>
test
## # A tibble: 320 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.3 0.65 0 1.2 0.065
## 3 8.9 0.22 0.48 1.8 0.077
## 4 7.6 0.41 0.24 1.8 0.08
## 5 7.1 0.71 0 1.9 0.08
## 6 5.7 1.13 0.09 1.5 0.172
## 7 7.3 0.45 0.36 5.9 0.074
## 8 8.1 0.66 0.22 2.2 0.069
## 9 6.8 0.67 0.02 1.8 0.05
## 10 5.6 0.31 0.37 1.4 0.074
## # … with 310 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, alcohol <dbl>, quality <dbl>,
## # pH <dbl>, sulphates <dbl>
Para la imputación de datos faltantes en las columnas “pH” y “sulphates”, se ha decidido reemplazar todos sus NAs según los valores medianos de las mismas variables.
Con la función summary se comprueba que ya no hay más datos faltantes en el data set train.
train$pH[is.na(train$pH)] <- median(train$pH, na.rm = TRUE)
train$sulphates[is.na(train$sulphates)] <- median(train$sulphates,
na.rm = TRUE)
summary(train)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. : 0.900
## 1st Qu.: 7.100 1st Qu.:0.3900 1st Qu.:0.1000 1st Qu.: 1.900
## Median : 7.900 Median :0.5200 Median :0.2600 Median : 2.200
## Mean : 8.357 Mean :0.5262 Mean :0.2732 Mean : 2.552
## 3rd Qu.: 9.300 3rd Qu.:0.6300 3rd Qu.:0.4300 3rd Qu.: 2.600
## Max. :15.900 Max. :1.5800 Max. :0.7900 Max. :15.500
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.0120 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.0710 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.0800 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.0882 Mean :15.86 Mean : 46.44 Mean :0.9968
## 3rd Qu.:0.0910 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9979
## Max. :0.6110 Max. :68.00 Max. :289.00 Max. :1.0037
## alcohol quality pH sulphates
## Min. : 8.40 Min. :3.000 Min. :2.860 Min. :0.3300
## 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.:3.220 1st Qu.:0.5600
## Median :10.20 Median :6.000 Median :3.300 Median :0.6200
## Mean :10.43 Mean :5.635 Mean :3.308 Mean :0.6526
## 3rd Qu.:11.10 3rd Qu.:6.000 3rd Qu.:3.380 3rd Qu.:0.7100
## Max. :14.90 Max. :8.000 Max. :4.010 Max. :1.9500
Analizamos como se distribuyen las diferentes variables de nuestro dataset.
train %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value, fill = key)) + facet_wrap(~key, scales = "free") +
geom_histogram(bins = sqrt(nrow(train))) + theme(legend.position = "none")
A partir de las gráficas podemos ver que algunas de las variables están distribuidas de forma normal, y parte de las variables están sesgadas a la derecha.
La distribución de “fixed_acidity” y “volatile_acidity” es muy similar, lo que indica que hay ciertas similitudes entre los dos indicadores fisicoquímicos.
Las variables “density” y el “pH” se distribuyen normalmente, lo que indica que todos los vinos tintos tienen poca diferencia en estos dos indicadores. No se requiere por tanto transformación alguna de su distribución.
Las variables “residual_sugar”, “chlorides”, “free_sulfur_dioxide”, “total_sulfur_dioxide”, and “sulphates” están muy sesgadas, por lo qye sería conveniente transformarlas para que la distribución de sus valores fuese más homogénea. Este resultado se consigue aplicando una transformación logarítmica y normalizando de esa manera sus distribuciones:
train <- train %>%
mutate(Log_residual_sugar = log(residual_sugar), Log_chlorides = log(chlorides),
Log_free_sulfur_dioxide = log(free_sulfur_dioxide), Log_total_sulfur_dioxide = log(total_sulfur_dioxide),
Log_sulphates = log(sulphates))
ga <- train %>%
ggplot(aes(x = Log_residual_sugar)) + geom_histogram(bins = 20,
fill = "#619CFF")
gb <- train %>%
ggplot(aes(x = Log_chlorides)) + geom_histogram(bins = 20,
fill = "#E58700")
gc <- train %>%
ggplot(aes(x = Log_free_sulfur_dioxide)) + geom_histogram(bins = 20,
fill = "#00BF7D")
gd <- train %>%
ggplot(aes(x = Log_total_sulfur_dioxide)) + geom_histogram(bins = 20,
fill = "#FD61D1")
ge <- train %>%
ggplot(aes(x = Log_sulphates)) + geom_histogram(bins = 20,
fill = "#B983FF")
grid.arrange(ga, gb, gc, gd, ge)
Modificamos nuestro dataset original para que las variables transformadas a logaritmos sustituyan a las mismas pero aún sin transformar. Tendremos de ese modo un dataset con 12 variables también, pero 5 de ellas transformadas a logaritmos.
train <- train %>%
dplyr::select(-residual_sugar, -chlorides, -free_sulfur_dioxide,
-total_sulfur_dioxide, -sulphates)
train %>%
summary
## fixed_acidity volatile_acidity citric_acid density
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. :0.9901
## 1st Qu.: 7.100 1st Qu.:0.3900 1st Qu.:0.1000 1st Qu.:0.9956
## Median : 7.900 Median :0.5200 Median :0.2600 Median :0.9968
## Mean : 8.357 Mean :0.5262 Mean :0.2732 Mean :0.9968
## 3rd Qu.: 9.300 3rd Qu.:0.6300 3rd Qu.:0.4300 3rd Qu.:0.9979
## Max. :15.900 Max. :1.5800 Max. :0.7900 Max. :1.0037
## alcohol quality pH Log_residual_sugar
## Min. : 8.40 Min. :3.000 Min. :2.860 Min. :-0.1054
## 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.:3.220 1st Qu.: 0.6419
## Median :10.20 Median :6.000 Median :3.300 Median : 0.7885
## Mean :10.43 Mean :5.635 Mean :3.308 Mean : 0.8554
## 3rd Qu.:11.10 3rd Qu.:6.000 3rd Qu.:3.380 3rd Qu.: 0.9555
## Max. :14.90 Max. :8.000 Max. :4.010 Max. : 2.7408
## Log_chlorides Log_free_sulfur_dioxide Log_total_sulfur_dioxide
## Min. :-4.4228 Min. :0.000 Min. :1.792
## 1st Qu.:-2.6451 1st Qu.:1.946 1st Qu.:3.091
## Median :-2.5257 Median :2.639 Median :3.638
## Mean :-2.4980 Mean :2.545 Mean :3.604
## 3rd Qu.:-2.3969 3rd Qu.:3.045 3rd Qu.:4.127
## Max. :-0.4927 Max. :4.220 Max. :5.666
## Log_sulphates
## Min. :-1.1087
## 1st Qu.:-0.5798
## Median :-0.4780
## Mean :-0.4496
## 3rd Qu.:-0.3425
## Max. : 0.6678
Una vez realizadas las transformaciones logaritmicas oportunas sobre las 5 variables que lo requerían, volvemos a ver de forma general las distribuciones del conjunto total de variables:
train %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value, fill = key)) + facet_wrap(~key, scales = "free") +
geom_histogram(bins = sqrt(nrow(train))) + theme(legend.position = "none")
Analizamos en detalle como se distribuye la variable de salida “quality” referente a las puntuaciones de calidad de entre 0 y 10 de los vinos.
ggplot(data = train) + geom_bar(mapping = aes(x = quality, fill = as.factor(quality))) +
labs(title = "Histograma Calidad Vino")
table(train$quality)
##
## 3 4 5 6 7 8
## 7 38 552 513 156 13
prop.table(table(train$quality))
##
## 3 4 5 6 7 8
## 0.005473026 0.029710711 0.431587177 0.401094605 0.121970289 0.010164191
Con la gráfica y los datos podemos ver que la mayor parte de los vinos (sobre un 83% de ellos) están clasificados con valores de calidad de 5 y 6, sobre calificaciones que van de 0 a 10.
Analizamos si nuestras variables tienen valores atípicos, cuales son sus valores medios y vemos sus intervalos de confianza, a través de gráficos de tipo Boxplot.
Boxplot variable alcohol
BoxPlot_alcohol <- ggplot(train, aes(x = factor(quality), y = alcohol)) +
geom_boxplot() + geom_boxplot(fill = "#F8766D") + ggtitle("Boxplot alcohol")
BoxPlot_alcohol
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general mayor % de alcohol.
Boxplot variable citric_acid
BoxPlot_citric_acid <- ggplot(train, aes(x = factor(quality),
y = citric_acid)) + geom_boxplot() + geom_boxplot(fill = "#E58700") +
ggtitle("Boxplot citric_acid")
BoxPlot_citric_acid
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general mayor cantidad de ácido cítrico.
Boxplot variable density
BoxPlot_density <- ggplot(train, aes(x = factor(quality), y = density)) +
geom_boxplot() + geom_boxplot(fill = "#C99800") + ggtitle("Boxplot density")
BoxPlot_density
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general una leve menor densidad, pero no es una variable determinante en la calidad del producto.
Boxplot variable fixed_acidity
BoxPlot_fixed_acidity <- ggplot(train, aes(x = factor(quality),
y = fixed_acidity)) + geom_boxplot() + geom_boxplot(fill = "#6BB100") +
ggtitle("Boxplot fixed_acidity")
BoxPlot_fixed_acidity
Apreciamos que la variable “fixed_acidity” se mantiene bastante estable independientemente de la calidad final del vino, sin tener grandes diferencias entre los diferentes rangos de calidad.
Boxplot variable Log_chlorides
BoxPlot_Log_chlorides <- ggplot(train, aes(x = factor(quality),
y = Log_chlorides)) + geom_boxplot() + geom_boxplot(fill = "#00BA38") +
ggtitle("Boxplot Log_chlorides")
BoxPlot_Log_chlorides
Apreciamos que la variable “Log_chlorides” se mantiene bastante estable independientemente de la calidad final del vino, sin tener grandes diferencias entre los diferentes rangos de calidad.
Boxplot variable Log_free_sulfur_dioxide
BoxPlot_Log_free_sulfur_dioxide <- ggplot(train, aes(x = factor(quality),
y = Log_free_sulfur_dioxide)) + geom_boxplot() + geom_boxplot(fill = "#00BF7D") +
ggtitle("Boxplot Log_free_sulfur_dioxide")
BoxPlot_Log_free_sulfur_dioxide
No se aprecia una tendencia específica en la variable “Log_free_sulfur_dioxide” que sea decisiva en la calidad del vino.
Boxplot variable Log_residual_sugar
BoxPlot_Log_residual_sugar <- ggplot(train, aes(x = factor(quality),
y = Log_residual_sugar)) + geom_boxplot() + geom_boxplot(fill = "#00C0AF") +
ggtitle("Boxplot Log_residual_sugar")
BoxPlot_Log_residual_sugar
Apreciamos que la variable “Log_residual_sugar” se mantiene bastante estable independientemente de la calidad final del vino, sin tener grandes diferencias entre los diferentes rangos de calidad.
Boxplot variable Log_sulphates
BoxPlot_Log_sulphates <- ggplot(train, aes(x = factor(quality),
y = Log_sulphates)) + geom_boxplot() + geom_boxplot(fill = "#00BCD8") +
ggtitle("Boxplot Log_sulphates")
BoxPlot_Log_sulphates
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general mayor cantidad de la variable “Log_sulphates”, aunque existen bastantes outlier e puntuaciones de 5 y 6, que podrían llevar a error.
Boxplot variable Log_total_sulfur_dioxide
BoxPlot_Log_total_sulfur_dioxide <- ggplot(train, aes(x = factor(quality),
y = Log_total_sulfur_dioxide)) + geom_boxplot() + geom_boxplot(fill = "#00B0F6") +
ggtitle("Boxplot Log_total_sulfur_dioxide")
BoxPlot_Log_total_sulfur_dioxide
No se aprecia una tendencia específica en la variable “Log_total_sulfur_dioxide” que sea decisiva en la calidad del vino.
Boxplot variable pH
BoxPlot_pH <- ggplot(train, aes(x = factor(quality), y = pH)) +
geom_boxplot() + geom_boxplot(fill = "#B983FF") + ggtitle("Boxplot pH")
BoxPlot_pH
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general un leve menor valor de pH,aunque existen numeros outliers en vinos puntuados con 5 y 6 que podrían llevar a error.
Boxplot variable volatile_acidity
BoxPlot_volatile_acidity <- ggplot(train, aes(x = factor(quality),
y = volatile_acidity)) + geom_boxplot() + geom_boxplot(fill = "#FF67A4") +
ggtitle("Boxplot volatile_acidity")
BoxPlot_volatile_acidity
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general menor cantidad de “ácido cítrico”volatile_acidity".
Continuando con en análisis de las distintas variables del data set y el estudio de como se relacionan entre si, queremos ver de forma global como se correlacionan las variables numéricas que nos pueden llegar a servir para el modelo de predicción.
pairs(train)
corrplot(cor(train %>%
mutate(quality = as.numeric(quality)) %>%
keep(is.numeric)))
res <- cor(train %>%
mutate(quality = as.numeric(quality)) %>%
keep(is.numeric))
round(res, 2)
## fixed_acidity volatile_acidity citric_acid density
## fixed_acidity 1.00 -0.26 0.68 0.68
## volatile_acidity -0.26 1.00 -0.55 0.02
## citric_acid 0.68 -0.55 1.00 0.37
## density 0.68 0.02 0.37 1.00
## alcohol -0.05 -0.21 0.15 -0.49
## quality 0.14 -0.39 0.25 -0.16
## pH -0.64 0.22 -0.49 -0.32
## Log_residual_sugar 0.20 0.04 0.19 0.44
## Log_chlorides 0.16 0.09 0.16 0.33
## Log_free_sulfur_dioxide -0.18 0.03 -0.11 -0.04
## Log_total_sulfur_dioxide -0.12 0.08 -0.03 0.11
## Log_sulphates 0.19 -0.30 0.32 0.14
## alcohol quality pH Log_residual_sugar Log_chlorides
## fixed_acidity -0.05 0.14 -0.64 0.20 0.16
## volatile_acidity -0.21 -0.39 0.22 0.04 0.09
## citric_acid 0.15 0.25 -0.49 0.19 0.16
## density -0.49 -0.16 -0.32 0.44 0.33
## alcohol 1.00 0.49 0.18 0.06 -0.29
## quality 0.49 1.00 -0.07 0.03 -0.16
## pH 0.18 -0.07 1.00 -0.10 -0.26
## Log_residual_sugar 0.06 0.03 -0.10 1.00 0.12
## Log_chlorides -0.29 -0.16 -0.26 0.12 1.00
## Log_free_sulfur_dioxide -0.09 -0.03 0.08 0.10 -0.02
## Log_total_sulfur_dioxide -0.24 -0.16 -0.03 0.17 0.06
## Log_sulphates 0.13 0.33 -0.14 0.02 0.22
## Log_free_sulfur_dioxide Log_total_sulfur_dioxide
## fixed_acidity -0.18 -0.12
## volatile_acidity 0.03 0.08
## citric_acid -0.11 -0.03
## density -0.04 0.11
## alcohol -0.09 -0.24
## quality -0.03 -0.16
## pH 0.08 -0.03
## Log_residual_sugar 0.10 0.17
## Log_chlorides -0.02 0.06
## Log_free_sulfur_dioxide 1.00 0.79
## Log_total_sulfur_dioxide 0.79 1.00
## Log_sulphates 0.06 0.04
## Log_sulphates
## fixed_acidity 0.19
## volatile_acidity -0.30
## citric_acid 0.32
## density 0.14
## alcohol 0.13
## quality 0.33
## pH -0.14
## Log_residual_sugar 0.02
## Log_chlorides 0.22
## Log_free_sulfur_dioxide 0.06
## Log_total_sulfur_dioxide 0.04
## Log_sulphates 1.00
Vemos que las variables que más estan correlacionadas con la variable “quality” son: “volatile_acidity”, “citric_acid”, “alcohol” y “Log_sulphates”.
Realizamos un análisis bivariante para ver que variables están más correlacionadas, positva o negativamente, entre si.
Correlación: fixed_acidity y citric_acid:
cor(x = train$fixed_acidity, y = train$citric_acid)
## [1] 0.678372
train %>%
ggplot(aes(fixed_acidity, citric_acid)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables fixed_acidity y citric_acid",
x = "fixed_acidity", y = "citric_acid")
Correlación: fixed_acidity y density:
cor(x = train$fixed_acidity, y = train$density)
## [1] 0.6782196
train %>%
ggplot(aes(fixed_acidity, density)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables fixed_acidity y density",
x = "fixed_acidity", y = "density")
Correlación: fixed_acidity y pH:
cor(x = train$fixed_acidity, y = train$pH)
## [1] -0.644656
train %>%
ggplot(aes(fixed_acidity, pH)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables fixed_acidity y pH",
x = "fixed_acidity", y = "pH")
Correlación: citric_acid y volatile_acidity:
cor(x = train$citric_acid, y = train$volatile_acidity)
## [1] -0.5538307
train %>%
ggplot(aes(citric_acid, volatile_acidity)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables citric_acid y volatile_acidity",
x = "citric_acid", y = "volatile_acidity")
Correlación: citric_acid y pH:
cor(x = train$citric_acid, y = train$pH)
## [1] -0.4941459
train %>%
ggplot(aes(citric_acid, pH)) + geom_point(alpha = 0.2, colour = "green") +
geom_smooth(formula = "y ~ x", method = "lm") + labs(title = "Relación entre variables citric_acid y pH",
x = "citric_acid", y = "pH")
Correlación: density y Log_residual_sugar:
cor(x = train$density, y = train$Log_residual_sugar)
## [1] 0.4399375
train %>%
ggplot(aes(density, Log_residual_sugar)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables density y Log_residual_sugar",
x = "density", y = "Log_residual_sugar")
Correlación: density y alcohol:
cor(x = train$density, y = train$alcohol)
## [1] -0.4880924
train %>%
ggplot(aes(density, alcohol)) + geom_point(alpha = 0.2, colour = "green") +
geom_smooth(formula = "y ~ x", method = "lm") + labs(title = "Relación entre variables density y alcohol",
x = "density", y = "alcohol")
Correlación: quality y alcohol:
cor(x = train$quality, y = train$alcohol)
## [1] 0.4895963
train %>%
ggplot(aes(train$quality, train$alcohol)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables quality y alcohol",
x = "quality", y = "alcohol")
Correlación: quality y volatile_acidity:
cor(x = train$quality, y = train$volatile_acidity)
## [1] -0.3904367
train %>%
ggplot(aes(quality, volatile_acidity)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables quality y volatile_acidity",
x = "quality", y = "volatile_acidity")
Correlación: Log_free_sulfur_dioxide y Log_total_sulfur_dioxide:
cor(x = train$Log_free_sulfur_dioxide, y = train$Log_total_sulfur_dioxide)
## [1] 0.7856495
train %>%
ggplot(aes(Log_free_sulfur_dioxide, Log_total_sulfur_dioxide)) +
geom_point(alpha = 0.2, colour = "green") + geom_smooth(formula = "y ~ x",
method = "lm") + labs(title = "Relación entre variables Log_free_sulfur_dioxide y Log_total_sulfur_dioxide",
x = "Log_free_sulfur_dioxide", y = "Log_total_sulfur_dioxide")
Una vez analizado en profundidad nuestro conjunto de datos y habiendo entendido y tranformado nuetras variables, trataremos de ajustar un modelo de regresión lineal múltiple que trate de predicir la calidad del vino tinto de la variedad portuguesa de “Vinho Verde”.
Ajustamos un modelo de regresión lineal mútiple con el que vamos a predecir el valor de la variable quality a partir de las siguientes variables independientes(cogemos todas las variables menos “Log_residual_sugar” que no presenta ninguna correlación con la variable “quality”) seleccionadas en base a los análisis y estudios de correlación vistos con anterioridad.
modelo = lm(quality ~ alcohol + fixed_acidity + volatile_acidity +
citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
Log_free_sulfur_dioxide + density + pH + Log_sulphates, data = train)
summary(modelo)
##
## Call:
## lm(formula = quality ~ alcohol + fixed_acidity + volatile_acidity +
## citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
## Log_free_sulfur_dioxide + density + pH + Log_sulphates, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.27885 -0.34712 -0.05254 0.43254 1.89518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.94814 18.67185 -0.533 0.594274
## alcohol 0.30374 0.02413 12.587 < 2e-16 ***
## fixed_acidity 0.01981 0.02396 0.827 0.408630
## volatile_acidity -1.04091 0.12864 -8.092 1.37e-15 ***
## citric_acid -0.31703 0.15340 -2.067 0.038963 *
## Log_chlorides -0.22535 0.06205 -3.632 0.000293 ***
## Log_total_sulfur_dioxide -0.15705 0.04408 -3.563 0.000380 ***
## Log_free_sulfur_dioxide 0.12800 0.04305 2.973 0.003001 **
## density 14.65001 19.00095 0.771 0.440842
## pH -0.50069 0.18785 -2.665 0.007788 **
## Log_sulphates 0.85751 0.09545 8.984 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6241 on 1268 degrees of freedom
## Multiple R-squared: 0.3862, Adjusted R-squared: 0.3813
## F-statistic: 79.77 on 10 and 1268 DF, p-value: < 2.2e-16
Para la selección de variables se utiliza el método de la selección automática por pasos.
empty.model <- lm(quality ~ 1, data = train)
horizonte <- formula(quality ~ alcohol + fixed_acidity + volatile_acidity +
citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
Log_free_sulfur_dioxide + density + pH + Log_sulphates)
# metodo de selección por pasos e indica las variables que
# son significativas
seleccion = stepAIC(empty.model, direction = c("both"), trace = FALSE,
scope = horizonte)
seleccion$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## quality ~ 1
##
## Final Model:
## quality ~ alcohol + volatile_acidity + Log_sulphates + Log_chlorides +
## pH + Log_total_sulfur_dioxide + Log_free_sulfur_dioxide
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 1278 804.4848 -590.9851
## 2 + alcohol 1 192.838647 1277 611.6461 -939.4926
## 3 + volatile_acidity 1 69.859784 1276 541.7863 -1092.6125
## 4 + Log_sulphates 1 29.717001 1275 512.0693 -1162.7631
## 5 + Log_chlorides 1 4.327006 1274 507.7423 -1171.6166
## 6 + pH 1 5.098410 1273 502.6439 -1182.5244
## 7 + Log_total_sulfur_dioxide 1 2.458061 1272 500.1858 -1186.7944
## 8 + Log_free_sulfur_dioxide 1 3.993776 1271 496.1921 -1195.0476
Vemos la información del modelo elegido como “mejor”
summary(seleccion)
##
## Call:
## lm(formula = quality ~ alcohol + volatile_acidity + Log_sulphates +
## Log_chlorides + pH + Log_total_sulfur_dioxide + Log_free_sulfur_dioxide,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.24223 -0.35766 -0.05925 0.43097 1.88984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.97311 0.44122 11.271 < 2e-16 ***
## alcohol 0.28289 0.01841 15.366 < 2e-16 ***
## volatile_acidity -0.90819 0.10940 -8.301 2.60e-16 ***
## Log_sulphates 0.86434 0.09387 9.208 < 2e-16 ***
## Log_chlorides -0.23817 0.05970 -3.989 7.00e-05 ***
## pH -0.52666 0.13211 -3.987 7.08e-05 ***
## Log_total_sulfur_dioxide -0.17181 0.04226 -4.065 5.10e-05 ***
## Log_free_sulfur_dioxide 0.13511 0.04224 3.198 0.00142 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6248 on 1271 degrees of freedom
## Multiple R-squared: 0.3832, Adjusted R-squared: 0.3798
## F-statistic: 112.8 on 7 and 1271 DF, p-value: < 2.2e-16
Nos quedamos con el modelo seleccionado como el mejor para la regresión según el método utilizado anteriormente.
mejor_modelo = lm(quality ~ alcohol + volatile_acidity + Log_sulphates +
Log_chlorides + pH + Log_total_sulfur_dioxide + citric_acid +
fixed_acidity, data = train)
summary(mejor_modelo)
##
## Call:
## lm(formula = quality ~ alcohol + volatile_acidity + Log_sulphates +
## Log_chlorides + pH + Log_total_sulfur_dioxide + citric_acid +
## fixed_acidity, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.31169 -0.35282 -0.05415 0.42911 1.86472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.27164 0.63721 6.704 3.05e-11 ***
## alcohol 0.30000 0.01884 15.921 < 2e-16 ***
## volatile_acidity -1.07446 0.12691 -8.466 < 2e-16 ***
## Log_sulphates 0.88622 0.09458 9.370 < 2e-16 ***
## Log_chlorides -0.21728 0.06080 -3.574 0.000365 ***
## pH -0.40708 0.16560 -2.458 0.014097 *
## Log_total_sulfur_dioxide -0.05076 0.02651 -1.915 0.055768 .
## citric_acid -0.38575 0.15186 -2.540 0.011197 *
## fixed_acidity 0.03469 0.01610 2.155 0.031357 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6258 on 1270 degrees of freedom
## Multiple R-squared: 0.3818, Adjusted R-squared: 0.3779
## F-statistic: 98.03 on 8 and 1270 DF, p-value: < 2.2e-16
Determinamos los intervalos de confianza para las observaciones de nuestros datos.
intervalos = predict(mejor_modelo, interval = "confidence", level = 0.95)
head(intervalos)
## fit lwr upr
## 1 5.545952 5.455099 5.636806
## 2 5.095579 5.013600 5.177558
## 3 4.981789 4.813456 5.150122
## 4 5.381935 5.312809 5.451061
## 5 5.445889 5.371787 5.519992
## 6 4.747729 4.589377 4.906082
La tabla anova nos muestra la significación de la regresión
anova = aov(mejor_modelo)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## alcohol 1 192.8 192.84 492.415 < 2e-16 ***
## volatile_acidity 1 69.9 69.86 178.387 < 2e-16 ***
## Log_sulphates 1 29.7 29.72 75.883 < 2e-16 ***
## Log_chlorides 1 4.3 4.33 11.049 0.000913 ***
## pH 1 5.1 5.10 13.019 0.000320 ***
## Log_total_sulfur_dioxide 1 2.5 2.46 6.277 0.012358 *
## citric_acid 1 1.0 1.01 2.584 0.108223
## fixed_acidity 1 1.8 1.82 4.644 0.031357 *
## Residuals 1270 497.4 0.39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vif(mejor_modelo)
## alcohol volatile_acidity Log_sulphates
## 1.326586 1.667617 1.236976
## Log_chlorides pH Log_total_sulfur_dioxide
## 1.301107 1.887347 1.125057
## citric_acid fixed_acidity
## 2.892183 2.660426
mean(vif(mejor_modelo))
## [1] 1.762162
Generalmente, un VIF por encima de 4 o una tolerancia por debajo de 0,25 indica que podría existir multicolinealidad (fuerte correlación entre variables explicativas del modelo) y se requiere más investigación. Cuando el VIF es superior a 10 o la tolerancia es inferior a 0,1, existe una multicolinealidad significativa que debe corregirse. En este caso no se observa multicolinealidad.
mean(mejor_modelo$residuals)
## [1] -1.419695e-17
# forma grafico 1
plot(mejor_modelo, 1)
# forma grafico 2 que te muestra lo mismo
autoplot(mejor_modelo, 1)
En el gráfico de Residuos vs. Ajustes se observa que la media de los residuos es cercana a cero (aunque no de forma constante), luego la linealidad del modelo no se viola en principio. Pero, al tener una variable dependiente como “quality” que es discreta, un modelo de regresión linela normal no se ajusta a nuestros datos.
Primero se comprueba la normalidad de los residuos, pero al usar Shapiro test solo permite usar las 5000 primeras muestras de los residuos, así que también usamos Anderson-Darling para comparar resultados
Shapiro-Wilk:
# muestras_residuos=resid(mejor_modelo) obtengo la
# ditribucion de los residuos estandarizados
muestras_residuos1 = studres(mejor_modelo)
residual_norm = shapiro.test(muestras_residuos1[0:5000])
residual_norm
##
## Shapiro-Wilk normality test
##
## data: muestras_residuos1[0:5000]
## W = 0.98969, p-value = 7.913e-08
Anderson-Darling:
# install.packages('nortest')
residual_anderson = ad.test(muestras_residuos1)
residual_anderson
##
## Anderson-Darling normality test
##
## data: muestras_residuos1
## A = 4.0651, p-value = 4.078e-10
Este supuesto de normalidad de los residuos también se puede comprobar graficamente y como se ve en la gráfica nuestros datos se separan en las colas de la línea principal y eso nos puede indicar que los residuos no siguen una distribución normal.
# Estas tres graficas te muestran lo mismo
plot(mejor_modelo, 2)
autoplot(mejor_modelo, 2)
hist(muestras_residuos1, freq = FALSE, main = "Distribución de los residuos estandarizados")
xfit <- seq(min(muestras_residuos1), max(muestras_residuos1),
length = 40)
yfit <- dnorm(xfit)
lines(xfit, yfit)
Con el Q-Q plot vemos que los residuos siguen una distribución normal o al menos se aproximan. Por tanto, se puede asumir que los estimadores de los coeficientes tengan una distribución normal.
Vamos a comprobar la homocedasticidad (que los residuos tengan una varianza constante)
Como podemos ver en los resultados p_value < 0.05 por tanto se rechaza la hipotesis nula y esto indica que la varianza no es constante para este modelo de regresion lineal(hay heterocedasticidad, y esto es un problema). Podemos concluir que este modelo matemático no es adecuado.
# https://fhernanb.github.io/libro_regresion/homo.html otra
# prueba para comprobar homocedasticidad
ncvTest(mejor_modelo)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 17.5856, Df = 1, p = 2.7466e-05
También podemos comprobar gráficamente la hocedasticidad, sería bueno que la línea roja sea lo más recta/horizontal posible.
plot(mejor_modelo, 3)
autoplot(mejor_modelo, 3)
Como se puede ver en los resultados el p_value > 0.05 por lo que aceptamos la Ho de que hay independencia.
dwtest(mejor_modelo)
##
## Durbin-Watson test
##
## data: mejor_modelo
## DW = 2.0088, p-value = 0.563
## alternative hypothesis: true autocorrelation is greater than 0
Se puede comprobar la independencia de los residuos gráficamente y como se observa no se ven patrónes extraños y esto nos puede indicar que hay independencia en los residuos y que estos no presentan autocorrelación.
plot(mejor_modelo$resid)
acf(mejor_modelo$residuals)
Para el análisis de componentes principales cogemos todas las variables de nuestro dataset, menos “quality” que es la que queremos tratar de predecir.
prcomp_train <- prcomp(train[, -6])
prcomp_train
## Standard deviations (1, .., p=11):
## [1] 1.7932919537 1.1139049116 0.8715249896 0.3582166080 0.3099346034
## [6] 0.3016065686 0.2104320866 0.1621592924 0.1042111183 0.0987024931
## [11] 0.0007134006
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4
## fixed_acidity 0.987707565 -0.0103012861 -1.022231e-01 0.062459844
## volatile_acidity -0.025573618 -0.0335218329 2.264482e-02 -0.079018774
## citric_acid 0.074203260 0.0250265448 -3.742426e-02 -0.046522512
## density 0.000724006 -0.0007910339 8.763104e-05 -0.001630274
## alcohol -0.033238009 0.9069893906 -3.959749e-01 -0.033337688
## pH -0.052371326 0.0213916196 4.913437e-03 0.011180103
## Log_residual_sugar 0.039855973 -0.0021044752 -1.037845e-01 -0.826895669
## Log_chlorides 0.031702071 -0.0848605722 3.174456e-02 -0.391082136
## Log_free_sulfur_dioxide -0.086908780 -0.2361937132 -6.575105e-01 0.323215453
## Log_total_sulfur_dioxide -0.061636505 -0.3343377688 -6.206551e-01 -0.211995173
## Log_sulphates 0.021906011 0.0167601084 -3.879522e-02 -0.018426525
## PC5 PC6 PC7 PC8
## fixed_acidity -0.0230198704 -0.0059651163 -0.064939449 -0.039228108
## volatile_acidity -0.0390618557 -0.0303160677 -0.597211835 -0.541189128
## citric_acid 0.0729849365 0.0909937418 0.381491021 0.409952831
## density -0.0002677616 -0.0007711058 0.000185434 -0.001576883
## alcohol 0.0565568851 0.0881221105 -0.082562820 -0.001514142
## pH -0.0569108452 -0.0518242796 -0.038186548 -0.140386718
## Log_residual_sugar -0.2808810688 -0.4555153363 0.125710162 -0.007655261
## Log_chlorides 0.8611890499 0.1118830637 -0.226520421 0.140520863
## Log_free_sulfur_dioxide 0.2215346137 -0.5890496504 -0.021993993 0.034613521
## Log_total_sulfur_dioxide -0.1995578676 0.6396298085 -0.031382301 -0.022315853
## Log_sulphates 0.2770165639 0.0647750830 0.645544341 -0.704485728
## PC9 PC10 PC11
## fixed_acidity 0.041080555 -0.0443511620 0.0008574666
## volatile_acidity -0.081317870 0.5769403933 0.0004106125
## citric_acid 0.118259173 0.8056760674 -0.0002945593
## density 0.004180642 -0.0002855141 -0.9999877190
## alcohol -0.015362480 -0.0183878205 -0.0008768774
## pH 0.983885416 -0.0388597280 0.0043212789
## Log_residual_sugar -0.025609909 -0.0311525353 0.0017331842
## Log_chlorides 0.071354850 -0.0867580300 0.0004730618
## Log_free_sulfur_dioxide -0.012248313 0.0445217874 -0.0001883117
## Log_total_sulfur_dioxide 0.025005306 -0.0566273254 0.0002213595
## Log_sulphates -0.053924917 0.0228971577 0.0009037505
Las desviaciones típicas son los autovalores de la matriz de correlaciones, y representan la variabilidad en cada componente. A mayor valor, más relevante es la variable correspondiente a efectos de visualización. Si queremos visualizar la importancia relativa de cada componente, haremos lo siguiente:
plot(prcomp_train)
De modo númérico:
summary(prcomp_train)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7933 1.1139 0.8715 0.35822 0.30993 0.30161 0.21043
## Proportion of Variance 0.5719 0.2207 0.1351 0.02282 0.01708 0.01618 0.00788
## Cumulative Proportion 0.5719 0.7926 0.9277 0.95052 0.96761 0.98378 0.99166
## PC8 PC9 PC10 PC11
## Standard deviation 0.16216 0.10421 0.09870 0.0007134
## Proportion of Variance 0.00468 0.00193 0.00173 0.0000000
## Cumulative Proportion 0.99634 0.99827 1.00000 1.0000000
Para solucionar el problema de que una variable tenga más relevancia y sea más influyente por el hecho de tener más magnitud, se debe realizar una estandarización:
prcomp_train <- prcomp(train[, -6], centre = TRUE, scale = TRUE)
prcomp_train
## Standard deviations (1, .., p=11):
## [1] 1.7629487 1.4397808 1.2644531 1.0430903 0.9814598 0.8259786 0.7584904
## [8] 0.6527267 0.5072574 0.4102052 0.2443802
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4
## fixed_acidity 0.497949147 -0.0774604042 0.07773493 -0.12761147
## volatile_acidity -0.228613756 0.2973557116 0.42725833 -0.12854199
## citric_acid 0.453908847 -0.1799547686 -0.23564528 -0.04506366
## density 0.414489403 0.2687629961 0.25839356 -0.17865109
## alcohol -0.094708706 -0.4167029020 -0.37845423 -0.33087315
## pH -0.409475084 -0.0009806387 -0.03806645 -0.15945585
## Log_residual_sugar 0.198616078 0.2082892990 -0.02938800 -0.70534037
## Log_chlorides 0.227309032 0.2157308428 0.20635906 0.40324174
## Log_free_sulfur_dioxide -0.074197539 0.4790706213 -0.48908782 0.01812011
## Log_total_sulfur_dioxide -0.004662316 0.5505085334 -0.39476662 0.01314934
## Log_sulphates 0.220597718 -0.0694794605 -0.32548381 0.37112255
## PC5 PC6 PC7 PC8
## fixed_acidity 0.20602872 -0.01298276 0.32851510 -0.2897369
## volatile_acidity -0.17298151 -0.29204526 0.60308842 -0.1819769
## citric_acid 0.08260043 -0.05040931 -0.15768865 -0.3605598
## density -0.04472090 0.41364082 0.07930408 -0.1990981
## alcohol -0.26306784 -0.40477572 0.21105461 -0.2724748
## pH -0.33063659 0.52406092 -0.18375767 -0.5565766
## Log_residual_sugar -0.37735390 -0.05607819 -0.22622011 0.4088799
## Log_chlorides -0.50419008 -0.43132228 -0.39929144 -0.2507839
## Log_free_sulfur_dioxide 0.11152021 -0.07612622 0.07572221 -0.1353631
## Log_total_sulfur_dioxide 0.14453935 -0.07108447 0.02253235 -0.0743601
## Log_sulphates -0.55709025 0.31997827 0.44949490 0.2744915
## PC9 PC10 PC11
## fixed_acidity 0.31604555 0.126404238 -0.611040342
## volatile_acidity -0.32007451 -0.213814616 0.005828197
## citric_acid -0.61379756 -0.394018864 0.088273875
## density 0.18509322 0.158370103 0.615559165
## alcohol 0.19703270 0.271013852 0.317142337
## pH -0.02868781 -0.001296862 -0.277713094
## Log_residual_sugar -0.04278776 -0.103130711 -0.205992070
## Log_chlorides 0.15996348 0.070066978 -0.059433915
## Log_free_sulfur_dioxide 0.43001182 -0.541972628 0.067211285
## Log_total_sulfur_dioxide -0.35634449 0.611908258 -0.086661149
## Log_sulphates -0.08777829 -0.028491471 -0.064760413
De modo numérico también:
summary(prcomp_train)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7629 1.4398 1.2645 1.04309 0.98146 0.82598 0.7585
## Proportion of Variance 0.2825 0.1885 0.1454 0.09891 0.08757 0.06202 0.0523
## Cumulative Proportion 0.2825 0.4710 0.6163 0.71526 0.80283 0.86485 0.9172
## PC8 PC9 PC10 PC11
## Standard deviation 0.65273 0.50726 0.4102 0.24438
## Proportion of Variance 0.03873 0.02339 0.0153 0.00543
## Cumulative Proportion 0.95588 0.97927 0.9946 1.00000
Analizamos la varianzas y las componentes de un modo más gráfico:
prcomp_train.var <- prcomp_train$sdev^2
prcomp_train.pvar <- prcomp_train.var/sum(prcomp_train.var)
plot(cumsum(prcomp_train.pvar), xlab = "components", ylab = "cumulative variance",
ylim = c(0, 1), type = "b")
grid()
abline(h = 0.95, col = "blue")
plot(prcomp_train, type = "l", main = "Variance explained by PCA")
fviz_screeplot(prcomp_train, addlabels = TRUE)
Como vemos, con las dos primeras componentes (PC1 y PC2) recogemos solo el 47.10% de la variabilidad. Con las tres primeras (PC1, PC2 y PC3) incrementamos la cifra hasta el 61.63%. Esto quiere decir que un gráfico de los datos del vino representados por las dos o tres primeras componentes principales no será suficientemente representativo. Vemos además en el gráfico de componentes y varianza acumulada, como son necesarias las 8 primeras PC para cubrir el 95% de la varianza del dataset. Es dificil encontrale sentido a reducir tan solo la dimensión de 11 variables a 8 PC, con la perdida de explicabilidad que eso implica sobre las variables originales.
Dibujamos los datos proyectados sobre las dos primeras componentes:
ggplot(as.data.frame(prcomp(train[, -6], scale = T)$x[, 1:2]),
aes(x = PC1, y = PC2, label = rownames(train))) + geom_point() +
geom_text(hjust = 0, vjust = 0)
Tratamos de incorporar la información de las variables utilizando la técnica del “biplot”:
ggbiplot(prcomp(train[, -6], labels = rownames(train), scale = T))
ggbiplot(prcomp(train[, -6], scale = T), ellipse = TRUE, labels = rownames(train),
groups = train$quality)
train_fquality <- train %>%
mutate(quality = as.factor(quality))
ggbiplot(prcomp_train, obs.scale = 1, var.scale = 1, groups = train_fquality$quality,
ellipse = TRUE, circle = TRUE) + scale_color_discrete(name = "") +
theme(legend.direction = "horizontal", legend.position = "top")
Vemos que el análisis con solo 2 componentes no es óptimo ya que por ellas mismas no explican un alto porcentaje de la varianza. Aún así, a nivel de análisis explicativo de los datos y de los posibles diferentes grupos, se intuye algún patrón ya que en principio cuanto más abajo del gráfico, mejor calificación tienen los vinos en general (puntos de colores azul y rosa son notas más cercanas a 7 y 8) y más arriba, peor calificación (puntos de colores verde, amarillo y rojo son notas de 5 para abajo).
Realizamos una ampliación del análisis realizado utilizando las 4 primeras componentes principales para tratar de identificar posible agrupaciones más claras de los datos.
colores <- function(vec) {
# la función rainbow() devuelve un vector que contiene
# el número de colores distintos
col <- rainbow(length(unique(vec)))
return(col[as.numeric(as.factor(vec))])
}
par(mfrow = c(1, 2))
# Observaciones sobre PC1 y PC2
plot(prcomp_train$x[, 1:2], col = colores(train_fquality$quality),
pch = 19, xlab = "PC1", ylab = "PC2")
# Observaciones sobre PC1 y PC3
plot(prcomp_train$x[, c(1, 3)], col = colores(train_fquality$quality),
pch = 19, xlab = "PC1", ylab = "PC3")
# Observaciones sobre PC1 y PC4
plot(prcomp_train$x[, c(1, 4)], col = colores(train_fquality$quality),
pch = 19, xlab = "PC1", ylab = "PC4")
La utilización de más componentes (ampliando el análisis hasta la tercera y la cuarta PC) vemos que aporta muy poco y no vemos agrupaciones claras o destacables entre los diferentes grupos. Esto se debe que incluso utilizando las 4 dimensiones de las 4 primeras PC, apenas lograriamos explicar un 71.52% de la varianza de los datos.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”).
train_pca <- train[, colnames(train) != "quality"]
train_pca$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
"suspenso")) # levels = c('FALSE', 'TRUE')
str(train_pca)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...
table(train_pca$nota_vino)
##
## aprobado suspenso
## 682 597
train_pca
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <fct>
Pasamos a realizar el análisis de las Componentes Principales tal y como se ha hecho con anterioridad:
prcomp_train_2 <- prcomp(train_pca[, -12], centre = TRUE, scale = TRUE)
prcomp_train_2
## Standard deviations (1, .., p=11):
## [1] 1.7629487 1.4397808 1.2644531 1.0430903 0.9814598 0.8259786 0.7584904
## [8] 0.6527267 0.5072574 0.4102052 0.2443802
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4
## fixed_acidity 0.497949147 -0.0774604042 0.07773493 -0.12761147
## volatile_acidity -0.228613756 0.2973557116 0.42725833 -0.12854199
## citric_acid 0.453908847 -0.1799547686 -0.23564528 -0.04506366
## density 0.414489403 0.2687629961 0.25839356 -0.17865109
## alcohol -0.094708706 -0.4167029020 -0.37845423 -0.33087315
## pH -0.409475084 -0.0009806387 -0.03806645 -0.15945585
## Log_residual_sugar 0.198616078 0.2082892990 -0.02938800 -0.70534037
## Log_chlorides 0.227309032 0.2157308428 0.20635906 0.40324174
## Log_free_sulfur_dioxide -0.074197539 0.4790706213 -0.48908782 0.01812011
## Log_total_sulfur_dioxide -0.004662316 0.5505085334 -0.39476662 0.01314934
## Log_sulphates 0.220597718 -0.0694794605 -0.32548381 0.37112255
## PC5 PC6 PC7 PC8
## fixed_acidity 0.20602872 -0.01298276 0.32851510 -0.2897369
## volatile_acidity -0.17298151 -0.29204526 0.60308842 -0.1819769
## citric_acid 0.08260043 -0.05040931 -0.15768865 -0.3605598
## density -0.04472090 0.41364082 0.07930408 -0.1990981
## alcohol -0.26306784 -0.40477572 0.21105461 -0.2724748
## pH -0.33063659 0.52406092 -0.18375767 -0.5565766
## Log_residual_sugar -0.37735390 -0.05607819 -0.22622011 0.4088799
## Log_chlorides -0.50419008 -0.43132228 -0.39929144 -0.2507839
## Log_free_sulfur_dioxide 0.11152021 -0.07612622 0.07572221 -0.1353631
## Log_total_sulfur_dioxide 0.14453935 -0.07108447 0.02253235 -0.0743601
## Log_sulphates -0.55709025 0.31997827 0.44949490 0.2744915
## PC9 PC10 PC11
## fixed_acidity 0.31604555 0.126404238 -0.611040342
## volatile_acidity -0.32007451 -0.213814616 0.005828197
## citric_acid -0.61379756 -0.394018864 0.088273875
## density 0.18509322 0.158370103 0.615559165
## alcohol 0.19703270 0.271013852 0.317142337
## pH -0.02868781 -0.001296862 -0.277713094
## Log_residual_sugar -0.04278776 -0.103130711 -0.205992070
## Log_chlorides 0.15996348 0.070066978 -0.059433915
## Log_free_sulfur_dioxide 0.43001182 -0.541972628 0.067211285
## Log_total_sulfur_dioxide -0.35634449 0.611908258 -0.086661149
## Log_sulphates -0.08777829 -0.028491471 -0.064760413
summary(prcomp_train_2)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7629 1.4398 1.2645 1.04309 0.98146 0.82598 0.7585
## Proportion of Variance 0.2825 0.1885 0.1454 0.09891 0.08757 0.06202 0.0523
## Cumulative Proportion 0.2825 0.4710 0.6163 0.71526 0.80283 0.86485 0.9172
## PC8 PC9 PC10 PC11
## Standard deviation 0.65273 0.50726 0.4102 0.24438
## Proportion of Variance 0.03873 0.02339 0.0153 0.00543
## Cumulative Proportion 0.95588 0.97927 0.9946 1.00000
ggbiplot(prcomp_train_2, obs.scale = 1, var.scale = 1, groups = train_pca$nota_vino,
ellipse = TRUE, circle = TRUE) + scale_color_discrete(name = "") +
theme(legend.direction = "horizontal", legend.position = "top")
Vemos que los resultados obtenidos son los mismos, no obteniendo ninguna mejora. Con esta forma de mostrar los datos realizamos una visualización más clara de lo que nos referiamos.Los puntos más abajo del gráfico se corresponden en general a vinos “aprobados” (puntos de color rosado - vinos con nota igual o superior a 6) y los de más arriba se referencian en general a vinos “suspensos” (puntos de color azulado - vino con notas inferiores a 6). Fuera de eso, y con tan solo un 47.10% de la varianza explicada por las 2 primeras PC, no se aprecian más patrones o conclusiones en los datos.
Intentamos realizar una reducción de la dimensión pero esta vez con métodos, al contrario de PCA, que no sean lineales. Con el algoritmo de t-SNE podemos separar datos que no sean separables de una forma lineal con exclusivamente una línea recta, es decir, nos puede llegar a permitir trabajar con datos lineales no separables. Nos puede servir para llegar a entender datos que tienen una alta dimensión projectándolo a una dimensión menor de solo 2 o 3 espacios o dimensiones.
tsne_train <- (train[, -6])
tsne_train
## # A tibble: 1,279 × 11
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 5 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>
El algoritmo crea una probabilidad de distribución que representa las similaridades entre los vecinos para así tratar de agruparlos, reduciendo la dimensión.
set.seed(3)
tsne_data <- tsne_train[, 1:11]
tsne <- Rtsne(tsne_data, check_duplicates = FALSE, perplexity = 30,
pca = FALSE, theta = 0.5, dims = 2, max_iter = 500, eta = 200,
epoch = 1000)
par(mfrow = c(1, 2))
plot(tsne$Y, col = "black", bg = train_fquality$quality, pch = 21,
cex = 1.5, main = "tSNE", xlab = "tSNE dimension 1", ylab = "tSNE dimension 2")
Como vemos los resultados, como en PCA, no son satisfactorios, siendo no deseable la apliclación de estas técnicas en nuestro dataset.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”).
train_tsne <- train[, colnames(train) != "quality"]
train_tsne$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
"suspenso")) # levels = c('FALSE', 'TRUE')
str(train_tsne)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...
table(train_tsne$nota_vino)
##
## aprobado suspenso
## 682 597
train_tsne
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <fct>
set.seed(3)
tsne_data_2 <- train_tsne[, 1:11]
tsne_2 <- Rtsne(tsne_data_2, check_duplicates = FALSE, perplexity = 30,
pca = FALSE, theta = 0.5, dims = 2, max_iter = 500, eta = 200,
epoch = 1000)
par(mfrow = c(1, 2))
plot(tsne$Y, col = "black", bg = train_tsne$nota_vino, pch = 21,
cex = 1.5, main = "tSNE", xlab = "tSNE dimension 1", ylab = "tSNE dimension 2")
Binarizando la variable respuesta tampoco sacamos demasiado en claro, no siendo posible aplicar una reducción de la dimensión sobre nuestros datos.
Lo primero de todo, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”)
train_glm <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_glm
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_glm$nota_vino)
##
## 0 1
## 597 682
str(train_glm)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
Realizando esta distinción entre vinos “Aprobados” y “Suspensos”, vemos que la distibución entre ambos grupos está bastante balanceada, con 597 suspensos y 682 aprobados en los datos de train.
Tras ello, pasamos a ver las correlaciones y el comportamiento de las variables con esta nueva variable categórica creada:
c <- cor(train_glm)
corrplot(c)
Mostramos las correlaciones de forma numérica:
round(c, 2)
## fixed_acidity volatile_acidity citric_acid density
## fixed_acidity 1.00 -0.26 0.68 0.68
## volatile_acidity -0.26 1.00 -0.55 0.02
## citric_acid 0.68 -0.55 1.00 0.37
## density 0.68 0.02 0.37 1.00
## alcohol -0.05 -0.21 0.15 -0.49
## pH -0.64 0.22 -0.49 -0.32
## Log_residual_sugar 0.20 0.04 0.19 0.44
## Log_chlorides 0.16 0.09 0.16 0.33
## Log_free_sulfur_dioxide -0.18 0.03 -0.11 -0.04
## Log_total_sulfur_dioxide -0.12 0.08 -0.03 0.11
## Log_sulphates 0.19 -0.30 0.32 0.14
## nota_vino 0.11 -0.32 0.18 -0.15
## alcohol pH Log_residual_sugar Log_chlorides
## fixed_acidity -0.05 -0.64 0.20 0.16
## volatile_acidity -0.21 0.22 0.04 0.09
## citric_acid 0.15 -0.49 0.19 0.16
## density -0.49 -0.32 0.44 0.33
## alcohol 1.00 0.18 0.06 -0.29
## pH 0.18 1.00 -0.10 -0.26
## Log_residual_sugar 0.06 -0.10 1.00 0.12
## Log_chlorides -0.29 -0.26 0.12 1.00
## Log_free_sulfur_dioxide -0.09 0.08 0.10 -0.02
## Log_total_sulfur_dioxide -0.24 -0.03 0.17 0.06
## Log_sulphates 0.13 -0.14 0.02 0.22
## nota_vino 0.45 -0.01 0.00 -0.14
## Log_free_sulfur_dioxide Log_total_sulfur_dioxide
## fixed_acidity -0.18 -0.12
## volatile_acidity 0.03 0.08
## citric_acid -0.11 -0.03
## density -0.04 0.11
## alcohol -0.09 -0.24
## pH 0.08 -0.03
## Log_residual_sugar 0.10 0.17
## Log_chlorides -0.02 0.06
## Log_free_sulfur_dioxide 1.00 0.79
## Log_total_sulfur_dioxide 0.79 1.00
## Log_sulphates 0.06 0.04
## nota_vino -0.06 -0.20
## Log_sulphates nota_vino
## fixed_acidity 0.19 0.11
## volatile_acidity -0.30 -0.32
## citric_acid 0.32 0.18
## density 0.14 -0.15
## alcohol 0.13 0.45
## pH -0.14 -0.01
## Log_residual_sugar 0.02 0.00
## Log_chlorides 0.22 -0.14
## Log_free_sulfur_dioxide 0.06 -0.06
## Log_total_sulfur_dioxide 0.04 -0.20
## Log_sulphates 1.00 0.28
## nota_vino 0.28 1.00
Analizamos de forma bivariante las variables:
# nota_vino vs alcohol
train_glm %>%
ggplot(aes(x = alcohol, fill = factor(nota_vino))) + geom_density(alpha = 0.5)
# nota_vino vs Log_sulphates
train_glm %>%
ggplot(aes(x = Log_sulphates, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs volatile_acidity
train_glm %>%
ggplot(aes(x = volatile_acidity, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs density
train_glm %>%
ggplot(aes(x = density, fill = factor(nota_vino))) + geom_density(alpha = 0.5)
# nota_vino vs citric_acid
train_glm %>%
ggplot(aes(x = citric_acid, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs Log_total_sulfur_dioxide
train_glm %>%
ggplot(aes(x = Log_total_sulfur_dioxide, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
En términos generales vemos como los vinos analizados que estan en la categoria de aprobados, tienen un mayor valor de “alcohol”, levenmente mayor valor de “Log_sulphates”, menor valor de “volatile_acidity”, levemente menor “density”, mayor “citric_acid” y menor valor de “Log_total_sulfur_dioxide”.
# nota_vino vs fixed_acidity
train_glm %>%
ggplot(aes(x = fixed_acidity, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs Log_free_sulfur_dioxide
train_glm %>%
ggplot(aes(x = Log_free_sulfur_dioxide, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs Log_residual_sugar
train_glm %>%
ggplot(aes(x = Log_residual_sugar, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs pH
train_glm %>%
ggplot(aes(x = pH, fill = factor(nota_vino))) + geom_density(alpha = 0.5)
# nota_vino vs Log_chlorides
train_glm %>%
ggplot(aes(x = Log_chlorides, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
En los casos de las variables “Log_chlorides”, “pH”, “Log_residual_sugar”, “Log_free_sulfur_dioxide” y “fixed_acidity”, cuesta más distinguir en el gráfico de densidad entre vinos aprobados o suspensos, ya que no son características definitivas de un grupo u otro.
Generamos un modelo de regresión logística en base a las variables de nuestro dataset que sirva como predictor de la variable binaria creada.
modelo_glm = glm(nota_vino ~ alcohol + fixed_acidity + volatile_acidity +
citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
Log_free_sulfur_dioxide + density + pH + Log_sulphates, data = train_glm,
family = binomial)
modelo_glm
##
## Call: glm(formula = nota_vino ~ alcohol + fixed_acidity + volatile_acidity +
## citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
## Log_free_sulfur_dioxide + density + pH + Log_sulphates, family = binomial,
## data = train_glm)
##
## Coefficients:
## (Intercept) alcohol fixed_acidity
## -24.7457 0.9700 0.1753
## volatile_acidity citric_acid Log_chlorides
## -3.2462 -1.7627 -0.4797
## Log_total_sulfur_dioxide Log_free_sulfur_dioxide density
## -0.6740 0.4413 17.5474
## pH Log_sulphates
## -0.1818 2.6310
##
## Degrees of Freedom: 1278 Total (i.e. Null); 1268 Residual
## Null Deviance: 1767
## Residual Deviance: 1300 AIC: 1322
summary(modelo_glm)
##
## Call:
## glm(formula = nota_vino ~ alcohol + fixed_acidity + volatile_acidity +
## citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
## Log_free_sulfur_dioxide + density + pH + Log_sulphates, family = binomial,
## data = train_glm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3178 -0.8232 0.3053 0.7861 2.4429
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.74575 73.63188 -0.336 0.73682
## alcohol 0.96996 0.10386 9.340 < 2e-16 ***
## fixed_acidity 0.17526 0.09475 1.850 0.06436 .
## volatile_acidity -3.24621 0.54263 -5.982 2.20e-09 ***
## citric_acid -1.76266 0.60585 -2.909 0.00362 **
## Log_chlorides -0.47972 0.24746 -1.939 0.05256 .
## Log_total_sulfur_dioxide -0.67404 0.17254 -3.907 9.36e-05 ***
## Log_free_sulfur_dioxide 0.44134 0.16690 2.644 0.00818 **
## density 17.54736 74.95795 0.234 0.81491
## pH -0.18179 0.73995 -0.246 0.80593
## Log_sulphates 2.63103 0.39252 6.703 2.04e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1767.4 on 1278 degrees of freedom
## Residual deviance: 1300.2 on 1268 degrees of freedom
## AIC: 1322.2
##
## Number of Fisher Scoring iterations: 4
Como observamos, nos quedamos solo con las variables significativas que relamente afectan a “nota_vino”, y creamos un nuevo modelo exclusivamente con ellas (“Log_sulphates”, “Log_total_sulfur_dioxide”, “volatile_acidity” y “alcohol”). De esta forma simplificamos el modelo, nos quedamos con las varibales realmente importantes para el modelo predictor y creamos el mejor modelo de regresión logística posible para nuestro conjunto de datos.
modelo_glm2 = glm(nota_vino ~ alcohol + volatile_acidity + Log_sulphates +
Log_total_sulfur_dioxide, data = train_glm, family = binomial)
modelo_glm2
##
## Call: glm(formula = nota_vino ~ alcohol + volatile_acidity + Log_sulphates +
## Log_total_sulfur_dioxide, family = binomial, data = train_glm)
##
## Coefficients:
## (Intercept) alcohol volatile_acidity
## -5.9714 0.9694 -2.7563
## Log_sulphates Log_total_sulfur_dioxide
## 2.2830 -0.3912
##
## Degrees of Freedom: 1278 Total (i.e. Null); 1274 Residual
## Null Deviance: 1767
## Residual Deviance: 1329 AIC: 1339
summary(modelo_glm2)
##
## Call:
## glm(formula = nota_vino ~ alcohol + volatile_acidity + Log_sulphates +
## Log_total_sulfur_dioxide, family = binomial, data = train_glm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0757 -0.8344 0.2981 0.8035 2.3837
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.97144 0.98206 -6.081 1.20e-09 ***
## alcohol 0.96938 0.07907 12.260 < 2e-16 ***
## volatile_acidity -2.75634 0.41634 -6.620 3.58e-11 ***
## Log_sulphates 2.28302 0.35200 6.486 8.82e-11 ***
## Log_total_sulfur_dioxide -0.39120 0.10198 -3.836 0.000125 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1767.4 on 1278 degrees of freedom
## Residual deviance: 1328.6 on 1274 degrees of freedom
## AIC: 1338.6
##
## Number of Fisher Scoring iterations: 4
Para realizar la interpretación de los coeficientes:
round(exp(cbind(Estimate = coef(modelo_glm2), confint(modelo_glm2))),
2)
## Estimate 2.5 % 97.5 %
## (Intercept) 0.00 0.00 0.02
## alcohol 2.64 2.27 3.09
## volatile_acidity 0.06 0.03 0.14
## Log_sulphates 9.81 4.97 19.79
## Log_total_sulfur_dioxide 0.68 0.55 0.83
Los intervalos de confianza no se basan en un test de Wald (como en regresión tradicional), sino en un perfilado (profiling) de la log-likelihood, que es más preciso.
Predicción de valores del modelo:
head(predict(modelo_glm2))
## 1 2 3 4 5 6
## 0.1559899 -1.5792806 -1.4338215 -0.6054977 -0.6622377 -1.5725449
Probabilidad en escala de la salida:
head(predict(modelo_glm2, type = "response"))
## 1 2 3 4 5 6
## 0.5389186 0.1708974 0.1925039 0.3530869 0.3402371 0.1718539
Evaluación del rendimiento predictivo del modelo GLM presentado con las datos de train:
train_glm$y_pred_probs <- predict(modelo_glm2, train_glm, type = "response")
train_glm$y_pred <- ifelse(train_glm$y_pred_probs > 0.5, 1, 0)
# train_glm$y_pred_probs train_glm$y_pred
cm_train <- confusionMatrix(as.factor(train_glm$y_pred), as.factor(train_glm$nota_vino),
positive = "1")
cm_train$table
## Reference
## Prediction 0 1
## 0 445 172
## 1 152 510
# result
cm_train$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.75
cm_train$byClass["Recall"] %>%
round(2)
## Recall
## 0.75
cm_train$byClass["Precision"] %>%
round(2)
## Precision
## 0.77
Viendo el valor de las metricas obtenidas, el valor de Accuracy (número de predicciones correctas/número total de predicciones) se situa en el 75%, el de Precision (positivos verdaderos/(positivos verdaderos + falsos positivos)) se situa en un 77%, y el de Recall o Sensitividad (positivos verdaderos/(positivos verdaderos/falsos negativos)) en un 75%.
Con estos datos entendemos que con el modelo desarrollado, en alrededor del 75% de los casos este será capaz de predecir si un vino aprueba en nota porque es razonablemente bueno (nota_vino >= 6) o sino suspende porque es realmente malo (nota_vino < 6).
Tratamos de aplicar Cross Validation sobre el modelo de GLM y realizar una selección de hiperparámetros:
Vemos primero cuales son las posibles variables que tienes el modelo para tratar de configurar. Cómo se puede ver, el modelo GLM no tiene la posibilidad de ajustar hiperparámetros.
## https://machinelearningmastery.com/how-to-estimate-model-accuracy-in-r-using-the-caret-package/?msclkid=37e9f222aa8711ec9c857e7c4b89d202
## https://daviddalpiaz.github.io/r4sl/the-caret-package.html#classification
# Vemos hiperparámetros que se pueden configurar
modelLookup("glm")
## model parameter label forReg forClass probModel
## 1 glm parameter parameter TRUE TRUE TRUE
Creamos el modelo con las variables seleccionadas como relevantes y haciendo Cross Validation on 5 particiones del dataset de train.
caret.glm <- train(as.factor(nota_vino) ~ alcohol + volatile_acidity + Log_sulphates + Log_total_sulfur_dioxide,
method = "glm",
family = "binomial",
data = train_glm,
trControl = trainControl(method = "cv", number = 5, search = "grid",returnResamp = "final"))
caret.glm
## Generalized Linear Model
##
## 1279 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1024, 1023, 1022, 1023
## Resampling results:
##
## Accuracy Kappa
## 0.7482917 0.4956541
summary(caret.glm)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0757 -0.8344 0.2981 0.8035 2.3837
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.97144 0.98206 -6.081 1.20e-09 ***
## alcohol 0.96938 0.07907 12.260 < 2e-16 ***
## volatile_acidity -2.75634 0.41634 -6.620 3.58e-11 ***
## Log_sulphates 2.28302 0.35200 6.486 8.82e-11 ***
## Log_total_sulfur_dioxide -0.39120 0.10198 -3.836 0.000125 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1767.4 on 1278 degrees of freedom
## Residual deviance: 1328.6 on 1274 degrees of freedom
## AIC: 1338.6
##
## Number of Fisher Scoring iterations: 4
Con estos datos entendemos que con el modelo desarrollado, en alrededor del 74/75% de los casos este será capaz de predecir si un vino aprueba en nota porque es razonablemente bueno (nota_vino >= 6) o sino suspende porque es realmente malo (nota_vino < 6).
confusionMatrix(caret.glm)
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1
## 0 35.1 13.6
## 1 11.6 39.7
##
## Accuracy (average) : 0.7482
Evaluación del rendimiento predictivo del modelo GLM presentado con las datos de train:
train_glm$y_pred_probs2 <- predict(caret.glm, train_glm, type = "prob")
train_glm$y_pred_probs2 <- ifelse(train_glm$y_pred_probs2$`1` >
0.5, train_glm$y_pred_probs2$`1`, 1 - train_glm$y_pred_probs2$`0`)
train_glm$y_pred2 <- ifelse(train_glm$y_pred_probs2 > 0.5, 1,
0)
# train_glm$y_pred_probs2 train_glm$y_pred2
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de GLM obtenido:
cm_train2 <- confusionMatrix(as.factor(train_glm$y_pred2), as.factor(train_glm$nota_vino),
positive = "1")
cm_train2$table
## Reference
## Prediction 0 1
## 0 445 172
## 1 152 510
# result
cm_train2$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.75
cm_train2$byClass["Recall"] %>%
round(2)
## Recall
## 0.75
cm_train2$byClass["Precision"] %>%
round(2)
## Precision
## 0.77
Reproducimos la curva ROC sobre el modelo final de GLM obtenido:
roc_glm <- plot.roc(as.numeric(train_glm$nota_vino), as.numeric(train_glm$y_pred_probs2),
col = "blue")
auc(roc_glm)
## Area under the curve: 0.8195
Se obtiene alrededor de un 82% de área bajo la curva.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
train_knn <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_knn
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_knn$nota_vino)
##
## 0 1
## 597 682
str(train_knn)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
# REFERENCIA:https://www.edureka.co/blog/knn-algorithm-in-r/
# train_knn <- train[, colnames(train) != 'quality']
# train_knn$nota_vino <- factor(train$quality < 6, labels =
# c('aprobado','suspenso')) # levels = c('FALSE', 'TRUE')
# train_knn table(train_knn$nota_vino)
# str(train_knn)
Lo primero de todo calculamos el número de observaciones que tiene nuestro dataset en train. Queremos así ver de inicio el número de “K” o vecinos con el que cuenta nuestro conjunto de datos de entrenamiento, para posteriormente y en base a ello aproximar el óptimo valor de “K”.
NROW <- NROW(train_knn)
NROW
## [1] 1279
Para obtener el valor óptimo aproximado de “K” realizamos la raiz cuadrada del número total de observaciones del dataset de train
sqrt(1279)
## [1] 35.76311
Probaremos con 35 y 36 vecinos como una primera aproximación del “k” óptimo en un modelo de knn.
Para tratar de realizar el modelo de knn dividimos nuestros datos de train en train y validación:
numero_total = nrow(train_knn)
w_train = 0.7
w_vali = 0.3
indices = seq(1:numero_total)
indices_train = sample(1:numero_total, numero_total * w_train)
indices_vali = sample(indices[-indices_train], numero_total *
w_vali)
k_train = train_knn[indices_train, ]
k_train
## # A tibble: 895 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10.4 0.44 0.73 0.999 12 3.17
## 2 11.7 0.28 0.47 0.997 10.6 3.15
## 3 10.7 0.4 0.37 0.997 11.2 3.12
## 4 7.9 0.34 0.36 0.994 11.2 3.3
## 5 6.1 0.58 0.23 0.994 12.5 3.46
## 6 11.5 0.59 0.59 0.999 11 3.18
## 7 6.8 0.59 0.1 0.996 9.7 3.3
## 8 7.2 0.53 0.13 0.996 9.9 3.21
## 9 7.7 0.75 0.27 0.997 9.3 3.24
## 10 6.9 0.84 0.21 0.998 9.23 3.53
## # … with 885 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
k_vali = train_knn[indices_vali, ]
Probamos un modelo simple de k vecinos con K = 35 y K = 36:
knn_simple_35 <- knn(k_train[, 1:11], k_vali[, 1:11], k = 35,
cl = as.factor(k_train$nota_vino))
knn_simple_36 <- knn(k_train[, 1:11], k_vali[, 1:11], k = 36,
cl = as.factor(k_train$nota_vino))
table(knn_simple_35, as.factor(k_vali$nota_vino))
##
## knn_simple_35 0 1
## 0 141 58
## 1 42 142
table(knn_simple_36, as.factor(k_vali$nota_vino))
##
## knn_simple_36 0 1
## 0 139 57
## 1 44 143
accuracy_knn_simple_35 = sum(knn_simple_35== k_vali$nota_vino) /nrow(k_vali)
accuracy_knn_simple_36 = sum(knn_simple_36== k_vali$nota_vino) /nrow(k_vali)
error_knn_simple_35 = 1-accuracy_knn_simple_35
error_knn_simple_36 = 1-accuracy_knn_simple_36
accuracy_knn_simple_35
## [1] 0.7389034
error_knn_simple_35
## [1] 0.2610966
print("...........................")
## [1] "..........................."
accuracy_knn_simple_36
## [1] 0.7362924
error_knn_simple_36
## [1] 0.2637076
Vemos que en ambos casos los resultados son muy parecidos obteniendo un accuracy de entorno al 73/74% de precisión.
Habiendo visto el modelo base, tratamos de ir un paso más allá creando otra versión que nos permita por un lado normalizar o estandarizar ls variables para tratarlas con una magnitud equivalente, ajustar de forma más precisa hiperparámetros a través de un número de “k” óptimo que venga dado realizando validación cruzada y evaluar un modelo para ver su precisión, robustez y capacidad de generalización.
modelLookup("knn")
## model parameter label forReg forClass probModel
## 1 knn k #Neighbors TRUE TRUE TRUE
Vemos que el parámetro que podemos ajustar es el valor de “k” que son el número de vecinos más cercanos con los que compararemos las diferentes observaciones y realizaremos la clasificación teniendo en cuena la distancia euclídea entre los puntos.
Tratamos de plantear un modelo de knn que incluya un proceso de validación cruzada de 5 folds, que centre y estándarice la escala de las variables, y que ajuste el hiperparámetro k de vecinos óptimo.
set.seed(22222220)
caret.knn = train(as.factor(nota_vino) ~ ., data = train_knn,
method = "knn", trControl = trainControl(method = "cv", number = 5,
search = "grid", returnResamp = "final"), preProcess = c("center",
"scale"), tuneGrid = expand.grid(k = seq(1, 101, by = 2)))
caret.knn
## k-Nearest Neighbors
##
## 1279 samples
## 11 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1023, 1023, 1023, 1023, 1024
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.7529259 0.5024638
## 3 0.7271293 0.4496676
## 5 0.7451256 0.4857637
## 7 0.7279075 0.4514521
## 9 0.7372702 0.4712542
## 11 0.7497855 0.4961405
## 13 0.7474357 0.4921884
## 15 0.7443045 0.4863425
## 17 0.7419700 0.4817875
## 19 0.7403860 0.4788672
## 21 0.7403922 0.4788520
## 23 0.7482108 0.4949287
## 25 0.7443076 0.4870466
## 27 0.7435202 0.4857680
## 29 0.7404013 0.4793634
## 31 0.7443107 0.4873547
## 33 0.7443168 0.4876106
## 35 0.7466575 0.4922032
## 37 0.7474357 0.4937484
## 39 0.7474387 0.4936762
## 41 0.7435294 0.4852850
## 43 0.7419669 0.4824611
## 45 0.7396201 0.4774906
## 47 0.7458732 0.4903228
## 49 0.7435263 0.4853138
## 51 0.7419669 0.4826133
## 53 0.7505576 0.4996890
## 55 0.7505607 0.4998633
## 57 0.7497794 0.4980862
## 59 0.7536918 0.5055618
## 61 0.7560386 0.5102242
## 63 0.7560417 0.5104059
## 65 0.7583885 0.5153964
## 67 0.7544761 0.5079165
## 69 0.7599510 0.5183972
## 71 0.7599540 0.5181401
## 73 0.7560355 0.5106282
## 75 0.7536918 0.5056467
## 77 0.7576011 0.5138357
## 79 0.7560509 0.5110409
## 81 0.7544761 0.5080859
## 83 0.7552574 0.5095624
## 85 0.7552543 0.5096503
## 87 0.7529136 0.5050209
## 89 0.7576072 0.5145375
## 91 0.7599540 0.5192917
## 93 0.7552635 0.5098494
## 95 0.7552727 0.5096173
## 97 0.7591850 0.5178204
## 99 0.7607445 0.5210021
## 101 0.7583946 0.5161423
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 99.
summary(caret.knn)
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 11 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
Conseguimos un valor de k óptimo en 99 vecinos que nos da un accuracy del 76.09% mejorando los resultados obtenidos con anterioridad.
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.knn)
## k Accuracy Kappa AccuracySD KappaSD
## 1 99 0.7607445 0.5210021 0.02308807 0.04514979
Con estos datos entendemos que con el modelo desarrollado, en alrededor del 76/77% de los casos este será capaz de predecir si un vino aprueba en nota porque es razonablemente bueno (nota_vino >= 6) o sino suspende porque es realmente malo (nota_vino < 6).
Evaluación del rendimiento predictivo del modelo KNN presentado con las datos de train:
train_knn$y_pred_probs2 <- predict(caret.knn, newdata = train_knn,
type = "prob")
train_knn$y_pred_probs2 <- ifelse(train_knn$y_pred_probs2$`1` >
0.5, train_knn$y_pred_probs2$`1`, 1 - train_knn$y_pred_probs2$`0`)
train_knn$y_pred2 <- ifelse(train_knn$y_pred_probs2 > 0.5, 1,
0)
# train_knn$y_pred_probs2 train_knn train_knn$y_pred2
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de KNN obtenido:
cm_train_knn <- confusionMatrix(as.factor(train_knn$y_pred2),
as.factor(train_knn$nota_vino), positive = "1")
cm_train_knn$table
## Reference
## Prediction 0 1
## 0 463 164
## 1 134 518
# result
cm_train_knn$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.77
cm_train_knn$byClass["Recall"] %>%
round(2)
## Recall
## 0.76
cm_train_knn$byClass["Precision"] %>%
round(2)
## Precision
## 0.79
Reproducimos la curva ROC sobre el modelo final de KNN obtenido:
roc_knn <- plot.roc(as.numeric(train_knn$nota_vino), as.numeric(train_knn$y_pred_probs2),
col = "green")
auc(roc_knn)
## Area under the curve: 0.8268
Se obtiene alrededor de un 82/83% de área bajo la curva.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
# train_tree <- train[, colnames(train)!='quality']
# train_tree$nota_vino <- factor(train$quality < 6, labels
# = c('aprobado', 'suspenso')) train_tree str(train_tree)
train_tree <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_tree
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_tree$nota_vino)
##
## 0 1
## 597 682
str(train_tree)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
Creamos un modelo de árbol de decisión inicial básico y sin podar:
# árbol de clasificación con las opciones por defecto (cp = 0.01 y split = "gini") con el comando:
tree = rpart(as.factor(nota_vino) ~ ., data = train_tree, cp=0.006)
rpart.plot(tree, nn = TRUE, extra = 104, box.palette = "GnBu", branch.lty = 3, shadow.col = "gray")
tree
## n= 1279
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1279 597 1 (0.4667709 0.5332291)
## 2) alcohol< 10.525 779 280 0 (0.6405648 0.3594352)
## 4) Log_sulphates< -0.4700356 494 129 0 (0.7388664 0.2611336) *
## 5) Log_sulphates>=-0.4700356 285 134 1 (0.4701754 0.5298246)
## 10) volatile_acidity>=0.545 103 33 0 (0.6796117 0.3203883)
## 20) Log_chlorides>=-2.333097 36 4 0 (0.8888889 0.1111111) *
## 21) Log_chlorides< -2.333097 67 29 0 (0.5671642 0.4328358)
## 42) fixed_acidity< 9.55 59 21 0 (0.6440678 0.3559322) *
## 43) fixed_acidity>=9.55 8 0 1 (0.0000000 1.0000000) *
## 11) volatile_acidity< 0.545 182 64 1 (0.3516484 0.6483516)
## 22) Log_total_sulfur_dioxide>=4.166635 42 17 0 (0.5952381 0.4047619) *
## 23) Log_total_sulfur_dioxide< 4.166635 140 39 1 (0.2785714 0.7214286) *
## 3) alcohol>=10.525 500 98 1 (0.1960000 0.8040000)
## 6) volatile_acidity>=0.87 24 6 0 (0.7500000 0.2500000) *
## 7) volatile_acidity< 0.87 476 80 1 (0.1680672 0.8319328) *
Analizamos los resultados obtenidos de forma numérica:
rpart.rules(tree, style = "tall")
## as.factor(nota_vino) is 0.11 when
## alcohol < 11
## volatile_acidity >= 0.55
## Log_sulphates >= -0.47
## Log_chlorides >= -2.3
##
## as.factor(nota_vino) is 0.25 when
## alcohol >= 11
## volatile_acidity >= 0.87
##
## as.factor(nota_vino) is 0.26 when
## alcohol < 11
## Log_sulphates < -0.47
##
## as.factor(nota_vino) is 0.36 when
## alcohol < 11
## volatile_acidity >= 0.55
## Log_sulphates >= -0.47
## Log_chlorides < -2.3
## fixed_acidity < 9.6
##
## as.factor(nota_vino) is 0.40 when
## alcohol < 11
## volatile_acidity < 0.55
## Log_sulphates >= -0.47
## Log_total_sulfur_dioxide >= 4.2
##
## as.factor(nota_vino) is 0.72 when
## alcohol < 11
## volatile_acidity < 0.55
## Log_sulphates >= -0.47
## Log_total_sulfur_dioxide < 4.2
##
## as.factor(nota_vino) is 0.83 when
## alcohol >= 11
## volatile_acidity < 0.87
##
## as.factor(nota_vino) is 1.00 when
## alcohol < 11
## volatile_acidity >= 0.55
## Log_sulphates >= -0.47
## Log_chlorides < -2.3
## fixed_acidity >= 9.6
obs_tree1 <- as.factor(train_tree$nota_vino)
head(predict(tree, newdata = train_tree))
## 0 1
## 1 0.7388664 0.2611336
## 2 0.7388664 0.2611336
## 3 0.7388664 0.2611336
## 4 0.5952381 0.4047619
## 5 0.6440678 0.3559322
## 6 0.7388664 0.2611336
pred_tree1 <- predict(tree, newdata = train_tree, type = "class")
table(obs_tree1, pred_tree1)
## pred_tree1
## obs_tree1 0 1
## 0 478 119
## 1 177 505
caret::confusionMatrix(pred_tree1, obs_tree1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 478 177
## 1 119 505
##
## Accuracy : 0.7686
## 95% CI : (0.7445, 0.7914)
## No Information Rate : 0.5332
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5379
##
## Mcnemar's Test P-Value : 0.0009228
##
## Sensitivity : 0.8007
## Specificity : 0.7405
## Pos Pred Value : 0.7298
## Neg Pred Value : 0.8093
## Prevalence : 0.4668
## Detection Rate : 0.3737
## Detection Prevalence : 0.5121
## Balanced Accuracy : 0.7706
##
## 'Positive' Class : 0
##
Obtenemos un valor del 76.86% para la precisión del modelo, con el incoveniente de tener un modelo sin poda, demasiado complejo y que puede tender al sobreajuste.
Realizamos la valoración para una posible poda del modelo que permita simplificarlo y hacerlo más explicativo sin perder capacidad predictora. Para ello vemos el CP o “Parámetro de complejidad” con el cual buscamos el árbol menos profundo que además minimice la tasa de error.
plotcp(tree) #CP - PARÁMETRO DE COMPLEJIDAD: Buscamos el árbol menos profundo que además minimiza la tasa de error
printcp(tree)
##
## Classification tree:
## rpart(formula = as.factor(nota_vino) ~ ., data = train_tree,
## cp = 0.006)
##
## Variables actually used in tree construction:
## [1] alcohol fixed_acidity Log_chlorides
## [4] Log_sulphates Log_total_sulfur_dioxide volatile_acidity
##
## Root node error: 597/1279 = 0.46677
##
## n= 1279
##
## CP nsplit rel error xerror xstd
## 1 0.3668342 0 1.00000 1.00000 0.029886
## 2 0.0452261 1 0.63317 0.65159 0.027559
## 3 0.0201005 3 0.54271 0.58291 0.026660
## 4 0.0134003 4 0.52261 0.56951 0.026464
## 5 0.0067002 5 0.50921 0.56951 0.026464
## 6 0.0060000 7 0.49581 0.56114 0.026339
Finalmente decimos proceder a realizar la poda y crear un modelo alternativo más simplificado:
xerror <- tree$cptable[, "xerror"]
imin.xerror <- which.min(xerror)
upper.xerror <- xerror[imin.xerror] + tree$cptable[imin.xerror,
"xstd"]
icp <- min(which(xerror <= upper.xerror))
cp <- tree$cptable[icp, "CP"]
cp
## [1] 0.0201005
tree_2 <- prune(tree, cp = cp)
# tree summary(tree) caret::varImp(tree) importance <-
# tree$variable.importance importance <-
# round(100*importance/sum(importance), 1)
# importance[importance >= 1]
rpart.plot(tree_2, nn = TRUE, extra = 104, box.palette = "GnBu",
branch.lty = 3, shadow.col = "gray") #, main='Classification tree winetaste'
obs_tree2 <- as.factor(train_tree$nota_vino)
head(predict(tree_2, newdata = train_tree))
## 0 1
## 1 0.7388664 0.2611336
## 2 0.7388664 0.2611336
## 3 0.7388664 0.2611336
## 4 0.3516484 0.6483516
## 5 0.6796117 0.3203883
## 6 0.7388664 0.2611336
pred_tree2 <- predict(tree_2, newdata = train_tree, type = "class")
table(obs_tree2, pred_tree2)
## pred_tree2
## obs_tree2 0 1
## 0 435 162
## 1 162 520
caret::confusionMatrix(pred_tree2, obs_tree2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 435 162
## 1 162 520
##
## Accuracy : 0.7467
## 95% CI : (0.7219, 0.7703)
## No Information Rate : 0.5332
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.4911
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7286
## Specificity : 0.7625
## Pos Pred Value : 0.7286
## Neg Pred Value : 0.7625
## Prevalence : 0.4668
## Detection Rate : 0.3401
## Detection Prevalence : 0.4668
## Balanced Accuracy : 0.7456
##
## 'Positive' Class : 0
##
Aplicando la poda a nuestro árbol obtenemos un modelo mas limpio, simple, explicativo y generalizable a otro conjunto de datos, evitando el posible sobreajuste del modelo y solo reduciendo su capacidad predictora a un valor de precisión del 74.67%. Entendemos que este modelo podado será el óptimo en este caso.
De cara a obtener el mejor modelo posible realizaremos validación cruzada de 5 folds y trataremos de ajustar hiperparámetros (el “cp” óptimo para un modelo ya validado). Utilizamos además las variables que hemos vito como más representativas y explicativas de la variable respuesta “nota_vino”.
# Fit the model on the training set
set.seed(1234)
caret.tree <- train(as.factor(nota_vino) ~ alcohol + volatile_acidity +
Log_sulphates + Log_total_sulfur_dioxide, data = train_tree,
method = "rpart", trControl = trainControl("cv", number = 5,
search = "grid", returnResamp = "final"), tuneLength = 20)
# Plot model accuracy vs different values of cp (complexity
# parameter)
plot(caret.tree)
caret.tree
## CART
##
## 1279 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1022, 1024, 1023, 1023, 1024
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00000000 0.7239386 0.4455300
## 0.01930706 0.7255101 0.4484141
## 0.03861412 0.7215885 0.4421371
## 0.05792118 0.7004855 0.4069132
## 0.07722825 0.7004855 0.4069132
## 0.09653531 0.7004855 0.4069132
## 0.11584237 0.7004855 0.4069132
## 0.13514943 0.7004855 0.4069132
## 0.15445649 0.7004855 0.4069132
## 0.17376355 0.7004855 0.4069132
## 0.19307062 0.7004855 0.4069132
## 0.21237768 0.7004855 0.4069132
## 0.23168474 0.7004855 0.4069132
## 0.25099180 0.7004855 0.4069132
## 0.27029886 0.7004855 0.4069132
## 0.28960592 0.7004855 0.4069132
## 0.30891299 0.7004855 0.4069132
## 0.32822005 0.7004855 0.4069132
## 0.34752711 0.7004855 0.4069132
## 0.36683417 0.6614230 0.3137864
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01930706.
caret.tree$bestTune
## cp
## 2 0.01930706
Realizando la validación cruzada vemos que el CP óptimo para nuestro modelo de árbol de decisión se encuentra en 0.01930706.
Visualizamos graficamente el árbol obtenido:
# Plot the final tree model
par(xpd = NA) # Avoid clipping the text in some device
plot(caret.tree$finalModel,uniform=TRUE)
text(caret.tree$finalModel, digits = 10)
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.tree)
## cp Accuracy Kappa AccuracySD KappaSD
## 1 0.01930706 0.7255101 0.4484141 0.02542086 0.05203991
Obtenemos finalmente haciendo validación cruzada una precisión del 72/73%, con un modelo que ha sido comprobado como robusto y generalizable para funcionar previsiblemente en otro conjunto de datos diferente.
Evaluación del rendimiento predictivo del modelo Decision Tree presentado con las datos de train:
train_tree$y_pred_probs2 <- predict(caret.tree, newdata = train_tree,
type = "prob")
train_tree$y_pred_probs2 <- ifelse(train_tree$y_pred_probs2$`1` >
0.5, train_tree$y_pred_probs2$`1`, 1 - train_tree$y_pred_probs2$`0`)
train_tree$y_pred2 <- ifelse(train_tree$y_pred_probs2 > 0.5,
1, 0)
# train_tree$y_pred_probs2 train_tree$y_pred2
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de Decision Tree obtenido:
cm_train_tree <- confusionMatrix(as.factor(train_tree$y_pred2),
as.factor(train_tree$nota_vino), positive = "1")
cm_train_tree$table
## Reference
## Prediction 0 1
## 0 453 168
## 1 144 514
# result
cm_train_tree$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.76
cm_train_tree$byClass["Recall"] %>%
round(2)
## Recall
## 0.75
cm_train_tree$byClass["Precision"] %>%
round(2)
## Precision
## 0.78
Reproducimos la curva ROC sobre el modelo final de Decision Tree obtenido:
roc_tree <- plot.roc(as.numeric(train_tree$nota_vino), as.numeric(train_tree$y_pred_probs2),
col = "yellow")
auc(roc_tree)
## Area under the curve: 0.7798
Se obtiene alrededor de un 78% de área bajo la curva.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
train_forest <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_forest
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_forest$nota_vino)
##
## 0 1
## 597 682
str(train_forest)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
# train_forest <- train[, colnames(train)!='quality']
# train_forest$nota_vino <- factor(train$quality < 6,
# labels = c('aprobado', 'suspenso')) train_forest
# str(train_forest)
Creamos el modelo base de bosque de árboles:
Examinamos la convergencia del error en las muestras:
plot(rf,main="")
legend("right", colnames(rf$err.rate), lty = 1:5, col = 1:6)
Vemos la relevancia de las variables en el modelo (vemos que la variable clave que más afecta al accuracy del modelo es “alcohol”)
varImpPlot(rf)
Vemos que el principal pa´rametro a configurar es el número de predictores al azar que toma el modelo.
modelLookup("rf")
## model parameter label forReg forClass probModel
## 1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
Creamos un modelo aplicando la validación cruzada y ajustando hiperparámetros (mtry, número de árboles y el tamaño de los nodos para regular su profundidad) de tal forma que creemos un modelo robusto y generalizable. Tomamos como base las 4 variable de mayor relevancia que hemos observado:
# Fit the model on the training set
set.seed(12345)
caret.rf <- train(as.factor(nota_vino) ~ alcohol + volatile_acidity +
Log_sulphates + Log_total_sulfur_dioxide, data = train_forest,
method = "rf", ntree = 100, importance = TRUE, metric = "Accuracy",
trControl = trainControl("cv", number = 5, search = "grid",
returnResamp = "final"), nodesize = 30, tuneLength = 10)
## note: only 3 unique complexity parameters in default grid. Truncating the grid to 3 .
plot(caret.rf)
caret.rf
## Random Forest
##
## 1279 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1023, 1022, 1023, 1024, 1024
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7646311 0.5276577
## 3 0.7622874 0.5232098
## 4 0.7552530 0.5094700
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
caret.rf$bestTune
## mtry
## 1 2
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.rf)
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.7646311 0.5276577 0.01880414 0.03791814
Evaluación del rendimiento predictivo del modelo Random Forest presentado con las datos de train:
train_forest$y_pred_probs2 <- predict(caret.rf, newdata = train_forest,
type = "prob")
train_forest$y_pred_probs2 <- ifelse(train_forest$y_pred_probs2$`1` >
0.5, train_forest$y_pred_probs2$`1`, 1 - train_forest$y_pred_probs2$`0`)
train_forest$y_pred2 <- ifelse(train_forest$y_pred_probs2 > 0.5,
1, 0)
# train_forest$y_pred_probs2 train_forest$y_pred2
# train_forest
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de Random Forest obtenido:
cm_train_forest <- confusionMatrix(as.factor(train_forest$y_pred2),
as.factor(train_forest$nota_vino), positive = "1")
cm_train_forest$table
## Reference
## Prediction 0 1
## 0 501 121
## 1 96 561
# result
cm_train_forest$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.83
cm_train_forest$byClass["Recall"] %>%
round(2)
## Recall
## 0.82
cm_train_forest$byClass["Precision"] %>%
round(2)
## Precision
## 0.85
Reproducimos la curva ROC sobre el modelo final de Random Forest obtenido:
roc_rf <- plot.roc(as.numeric(train_forest$nota_vino), as.numeric(train_forest$y_pred_probs2),
col = "red")
auc(roc_rf)
## Area under the curve: 0.926
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
# https://rubenfcasal.github.io/aprendizaje_estadistico/boosting-en-r.html
train_en <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_en
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_en$nota_vino)
##
## 0 1
## 597 682
str(train_en)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
# train_en <- train[, colnames(train)!='quality']
# train_en$nota_vino <- factor(train$quality < 6, labels =
# c('aprobado', 'suspenso')) # levels = c('FALSE', 'TRUE')
# str(train_en)
Creamos el modelo de boosting con una configuración inicial básica de parámetros:
ada.boost <- ada(as.factor(nota_vino) ~ ., data = train_en, type = "real",
control = rpart.control(maxdepth = 2, cp = 0, minsplit = 10, xval = 0),
iter = 150, nu = 0.05)
ada.boost
## Call:
## ada(as.factor(nota_vino) ~ ., data = train_en, type = "real",
## control = rpart.control(maxdepth = 2, cp = 0, minsplit = 10,
## xval = 0), iter = 150, nu = 0.05)
##
## Loss: exponential Method: real Iteration: 150
##
## Final Confusion Matrix for Data:
## Final Prediction
## True value 0 1
## 0 468 129
## 1 150 532
##
## Train Error: 0.218
##
## Out-Of-Bag Error: 0.224 iteration= 148
##
## Additional Estimates of number of iterations:
##
## train.err1 train.kap1
## 119 119
Vemos la evolución decreciente del error al aumentar el número de iteraciones en el modelo
plot(ada.boost)
Evaluamos la precisión del modelo en la muestra de train:
set.seed(123)
pred_ada <- predict(ada.boost, newdata = train_en)
caret::confusionMatrix(pred_ada, as.factor(train_en$nota_vino), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 468 150
## 1 129 532
##
## Accuracy : 0.7819
## 95% CI : (0.7582, 0.8042)
## No Information Rate : 0.5332
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5627
##
## Mcnemar's Test P-Value : 0.2312
##
## Sensitivity : 0.7801
## Specificity : 0.7839
## Pos Pred Value : 0.8048
## Neg Pred Value : 0.7573
## Prevalence : 0.5332
## Detection Rate : 0.4159
## Detection Prevalence : 0.5168
## Balanced Accuracy : 0.7820
##
## 'Positive' Class : 1
##
Con la configuración de parámetros realizada en el modelo ada de booting obtenemos un valor de accuracy del 78/79% para el caso de algoritmos de clasificación.
Para optimizar los resultados del modelo creado y la generalización del modelo, se puede realizar un ajuste de hiperparámetros y validación cruzada:
modelLookup("ada")
## model parameter label forReg forClass probModel
## 1 ada iter #Trees FALSE TRUE TRUE
## 2 ada maxdepth Max Tree Depth FALSE TRUE TRUE
## 3 ada nu Learning Rate FALSE TRUE TRUE
Vemos los parámetros de “iter”, “maxdepth” y “nu” que tiene el modelo ada de boosting para árboles de decisión en problemas de clasificación.
set.seed(123)
caret.ada <- train(as.factor(nota_vino) ~ ., method = "ada", data = train_en,
trControl = trainControl(method = "cv", number = 5, search = "grid",returnResamp = "final"))
caret.ada
## Boosted Classification Trees
##
## 1279 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1023, 1022, 1024, 1023
## Resampling results across tuning parameters:
##
## maxdepth iter Accuracy Kappa
## 1 50 0.7443213 0.4890732
## 1 100 0.7584084 0.5160080
## 1 150 0.7576149 0.5133790
## 2 50 0.7568276 0.5121128
## 2 100 0.7529335 0.5043760
## 2 150 0.7591927 0.5167245
## 3 50 0.7599647 0.5189156
## 3 100 0.7607643 0.5206920
## 3 150 0.7568488 0.5125098
##
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 100, maxdepth = 3 and nu = 0.1.
Obtenemos una configuración óptima de los hiperparámetros del modelo en “iter” = 100, “maxdepth” = 3 y “nu” = 0.1.
caret.ada
## Boosted Classification Trees
##
## 1279 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1023, 1022, 1024, 1023
## Resampling results across tuning parameters:
##
## maxdepth iter Accuracy Kappa
## 1 50 0.7443213 0.4890732
## 1 100 0.7584084 0.5160080
## 1 150 0.7576149 0.5133790
## 2 50 0.7568276 0.5121128
## 2 100 0.7529335 0.5043760
## 2 150 0.7591927 0.5167245
## 3 50 0.7599647 0.5189156
## 3 100 0.7607643 0.5206920
## 3 150 0.7568488 0.5125098
##
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 100, maxdepth = 3 and nu = 0.1.
caret.ada$bestTune
## iter maxdepth nu
## 8 100 3 0.1
Con el modelo de base obtenemos un accuracy del 80.3% con los datos de train.
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.ada)
## nu maxdepth iter Accuracy Kappa AccuracySD KappaSD
## 1 0.1 3 100 0.7607643 0.520692 0.02106294 0.04217624
Evaluación del rendimiento predictivo del modelo Ada Boost presentado con las datos de train:
train_en$y_pred_probs2 <- predict(caret.ada, newdata = train_en,
type = "prob")
train_en$y_pred_probs2 <- ifelse(train_en$y_pred_probs2$`1` >
0.5, train_en$y_pred_probs2$`1`, 1 - train_en$y_pred_probs2$`0`)
train_en$y_pred2 <- ifelse(train_en$y_pred_probs2 > 0.5, 1, 0)
# train_forest$y_pred_probs2 train_en$y_pred2 train_en
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de Ada Boost obtenido:
cm_train_en <- confusionMatrix(as.factor(train_en$y_pred2), as.factor(train_en$nota_vino),
positive = "1")
cm_train_en$table
## Reference
## Prediction 0 1
## 0 479 135
## 1 118 547
# result
cm_train_en$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.8
cm_train_en$byClass["Recall"] %>%
round(2)
## Recall
## 0.8
cm_train_en$byClass["Precision"] %>%
round(2)
## Precision
## 0.82
Reproducimos la curva ROC sobre el modelo final de Ada Boost obtenido:
roc_ada <- plot.roc(as.numeric(train_en$nota_vino), as.numeric(train_en$y_pred_probs2),
col = "orange")
auc(roc_ada)
## Area under the curve: 0.8912
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
train_xgb <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_xgb
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_xgb$nota_vino)
##
## 0 1
## 597 682
str(train_xgb)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
# train_xgb <- train[, colnames(train)!='quality']
# train_xgb$nota_vino <- factor(train$quality < 6, labels =
# c('aprobado', 'suspenso')) # levels = c('FALSE', 'TRUE')
# str(train_xgb)
Para optimizar los resultados del modelo creado, se puede realizar un ajuste de hiperparámetros con validación cruzada:
modelLookup("xgbTree")
## model parameter label forReg forClass
## 1 xgbTree nrounds # Boosting Iterations TRUE TRUE
## 2 xgbTree max_depth Max Tree Depth TRUE TRUE
## 3 xgbTree eta Shrinkage TRUE TRUE
## 4 xgbTree gamma Minimum Loss Reduction TRUE TRUE
## 5 xgbTree colsample_bytree Subsample Ratio of Columns TRUE TRUE
## 6 xgbTree min_child_weight Minimum Sum of Instance Weight TRUE TRUE
## 7 xgbTree subsample Subsample Percentage TRUE TRUE
## probModel
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
## 7 TRUE
Creamos el modelo de boosting con una configuración inicial dada de parámetros:
Obtenemos una configuración óptima de los hiperparámetros del modelo en:
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.xgb)
## eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
## 1 0.3 3 0 0.6 1 1 150
## Accuracy Kappa AccuracySD KappaSD
## 1 0.7967279 0.5917263 0.0189888 0.03670645
Vemos la relevancia de cada variable en el modelo:
varImp(caret.xgb)
## xgbTree variable importance
##
## Overall
## alcohol 100.000
## Log_sulphates 65.538
## volatile_acidity 52.167
## density 39.356
## Log_total_sulfur_dioxide 39.330
## Log_chlorides 27.923
## pH 12.213
## citric_acid 4.803
## Log_free_sulfur_dioxide 4.607
## fixed_acidity 2.691
## Log_residual_sugar 0.000
Evaluación del rendimiento predictivo del modelo XG Boost presentado con las datos de train:
train_xgb$y_pred_probs2 <- predict(caret.xgb, newdata = train_xgb,
type = "prob")
train_xgb$y_pred_probs2 <- ifelse(train_xgb$y_pred_probs2$`1` >
0.5, train_xgb$y_pred_probs2$`1`, 1 - train_xgb$y_pred_probs2$`0`)
train_xgb$y_pred2 <- ifelse(train_xgb$y_pred_probs2 > 0.5, 1,
0)
# train_xgb$y_pred_probs2 train_xgb$y_pred2 train_xgb
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de XG Boost obtenido:
cm_train_xgb <- confusionMatrix(as.factor(train_xgb$y_pred2),
as.factor(train_xgb$nota_vino), positive = "1")
cm_train_xgb$table
## Reference
## Prediction 0 1
## 0 584 15
## 1 13 667
# result
cm_train_xgb$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.98
cm_train_xgb$byClass["Recall"] %>%
round(2)
## Recall
## 0.98
cm_train_xgb$byClass["Precision"] %>%
round(2)
## Precision
## 0.98
Reproducimos la curva ROC sobre el modelo final de XG Boost obtenido:
roc_xgb <- plot.roc(as.numeric(train_xgb$nota_vino), as.numeric(train_xgb$y_pred_probs2),
col = "purple")
auc(roc_xgb)
## Area under the curve: 0.9976
En nuestro dataset de train y test, hemos creado la variable binaria “nota_vino”para que, en función de “quality,nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
#train_svm <- train[, colnames(train)!="quality"]
#train_svm$nota_vino <- factor(train$quality < 6, labels = c('aprobado', 'suspenso')) # levels = c('FALSE', 'TRUE')
#train_svm
train_svm <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_svm
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_svm$nota_vino)
##
## 0 1
## 597 682
str(train_svm)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
modelLookup('svmLinear')
## model parameter label forReg forClass probModel
## 1 svmLinear C Cost TRUE TRUE TRUE
set.seed(5555)
# Fit the model
caret.svm <- train(as.factor(nota_vino) ~., data = train_svm,
method = "svmLinear",
trControl = trainControl("cv", number = 5, search = "grid",returnResamp = "final"),
preProcess = c("center","scale"),
tuneGrid = expand.grid(C = seq(0.1, 2, length = 20)))
#View the model
caret.svm
## Support Vector Machines with Linear Kernel
##
## 1279 samples
## 11 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1023, 1024, 1022, 1024, 1023
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.1 0.7521827 0.5055023
## 0.2 0.7545265 0.5098180
## 0.3 0.7545295 0.5098578
## 0.4 0.7537452 0.5082373
## 0.5 0.7529609 0.5066244
## 0.6 0.7545234 0.5097268
## 0.7 0.7545234 0.5097268
## 0.8 0.7545234 0.5097268
## 0.9 0.7560890 0.5129537
## 1.0 0.7560890 0.5129537
## 1.1 0.7560890 0.5129537
## 1.2 0.7553047 0.5114252
## 1.3 0.7553047 0.5114252
## 1.4 0.7553047 0.5114252
## 1.5 0.7553047 0.5114252
## 1.6 0.7553047 0.5114252
## 1.7 0.7553047 0.5114252
## 1.8 0.7553047 0.5114252
## 1.9 0.7545204 0.5098984
## 2.0 0.7553047 0.5114252
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.9.
plot(caret.svm)
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.svm)
## C Accuracy Kappa AccuracySD KappaSD
## 1 0.9 0.756089 0.5129537 0.03569907 0.06917993
train_svm$y_pred_probs2 <- predict(caret.svm, newdata = train_svm)
# train_svm$y_pred_probs2 <-
# ifelse(train_svm$y_pred_probs2$`1` > 0.5,
# train_svm$y_pred_probs2$`1`,
# 1-train_svm$y_pred_probs2$`0`)
# train_svm$y_pred2 <- ifelse(train_svm$y_pred_probs2 >
# 0.5, 1, 0)
# train_svm$y_pred_probs2 train_svm train_svm$y_pred2
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de SVM obtenido:
cm_train_svm <- confusionMatrix(as.factor(train_svm$y_pred_probs2),
as.factor(train_svm$nota_vino), positive = "1")
cm_train_svm$table
## Reference
## Prediction 0 1
## 0 465 181
## 1 132 501
# result
cm_train_svm$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.76
cm_train_svm$byClass["Recall"] %>%
round(2)
## Recall
## 0.73
cm_train_svm$byClass["Precision"] %>%
round(2)
## Precision
## 0.79
Reproducimos la curva ROC sobre el modelo final SVM obtenido:
roc_svm <- plot.roc(as.numeric(train_svm$nota_vino), as.numeric(train_svm$y_pred_probs2),
col = "aquamarine3")
auc(roc_svm)
## Area under the curve: 0.7567
Se obtiene alrededor de un 75/76% de área bajo la curva.
modelLookup('svmRadial')
## model parameter label forReg forClass probModel
## 1 svmRadial sigma Sigma TRUE TRUE TRUE
## 2 svmRadial C Cost TRUE TRUE TRUE
set.seed(6666)
# Fit the model
hiperparametros <- expand.grid(sigma = seq(0.001, 1,length = 20),
C = seq(0.1, 2,length = 20))
caret.svm.radial <- train(as.factor(nota_vino) ~., data = train_svm,
method = "svmRadial",
trControl = trainControl("cv", number = 5, search = "grid", returnResamp = "final"),tuneGrid = hiperparametros,
preProcess = c("center","scale"),
tuneLength = 20)
#View the model
caret.svm.radial
## Support Vector Machines with Radial Basis Function Kernel
##
## 1279 samples
## 12 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (12), scaled (12)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1023, 1024, 1023, 1023, 1023
## Resampling results across tuning parameters:
##
## sigma C Accuracy Kappa
## 0.00100000 0.1 0.5332292 0.000000000
## 0.00100000 0.2 0.6974326 0.378468426
## 0.00100000 0.3 0.7560784 0.512360411
## 0.00100000 0.4 0.7568597 0.514278875
## 0.00100000 0.5 0.7552972 0.511062060
## 0.00100000 0.6 0.7552972 0.511062060
## 0.00100000 0.7 0.7552972 0.511062060
## 0.00100000 0.8 0.7552972 0.511062060
## 0.00100000 0.9 0.7552972 0.511062060
## 0.00100000 1.0 0.7552972 0.511062060
## 0.00100000 1.1 0.7552972 0.511062060
## 0.00100000 1.2 0.7552972 0.511062060
## 0.00100000 1.3 0.7552972 0.511062060
## 0.00100000 1.4 0.7552972 0.511062060
## 0.00100000 1.5 0.7552972 0.511062060
## 0.00100000 1.6 0.7552972 0.511062060
## 0.00100000 1.7 0.7552972 0.511062060
## 0.00100000 1.8 0.7552972 0.511062060
## 0.00100000 1.9 0.7552972 0.511062060
## 0.00100000 2.0 0.7552972 0.511062060
## 0.05357895 0.1 0.7537286 0.507898628
## 0.05357895 0.2 0.7537286 0.507989008
## 0.05357895 0.3 0.7537286 0.507989008
## 0.05357895 0.4 0.7537286 0.507989008
## 0.05357895 0.5 0.7537255 0.507970820
## 0.05357895 0.6 0.7545098 0.509591793
## 0.05357895 0.7 0.7545098 0.509590596
## 0.05357895 0.8 0.7545098 0.509590596
## 0.05357895 0.9 0.7552911 0.511200944
## 0.05357895 1.0 0.7552880 0.511285990
## 0.05357895 1.1 0.7552880 0.511383027
## 0.05357895 1.2 0.7552880 0.511479861
## 0.05357895 1.3 0.7552849 0.511568018
## 0.05357895 1.4 0.7568505 0.514797442
## 0.05357895 1.5 0.7607598 0.522666658
## 0.05357895 1.6 0.7591912 0.519531862
## 0.05357895 1.7 0.7599755 0.521050919
## 0.05357895 1.8 0.7607567 0.522744778
## 0.05357895 1.9 0.7646599 0.530664338
## 0.05357895 2.0 0.7646599 0.530862996
## 0.10615789 0.1 0.7584283 0.516922226
## 0.10615789 0.2 0.7576409 0.515765398
## 0.10615789 0.3 0.7576409 0.515778715
## 0.10615789 0.4 0.7607721 0.522141386
## 0.10615789 0.5 0.7623254 0.525311332
## 0.10615789 0.6 0.7654473 0.531824396
## 0.10615789 0.7 0.7646630 0.530297611
## 0.10615789 0.8 0.7638756 0.528755397
## 0.10615789 0.9 0.7677849 0.536617951
## 0.10615789 1.0 0.7677849 0.536617951
## 0.10615789 1.1 0.7670006 0.535196193
## 0.10615789 1.2 0.7677819 0.536903972
## 0.10615789 1.3 0.7646538 0.530677748
## 0.10615789 1.4 0.7654350 0.532282564
## 0.10615789 1.5 0.7662194 0.533900684
## 0.10615789 1.6 0.7654350 0.532283358
## 0.10615789 1.7 0.7646538 0.530681931
## 0.10615789 1.8 0.7630913 0.527571213
## 0.10615789 1.9 0.7630913 0.527571213
## 0.10615789 2.0 0.7638725 0.529200915
## 0.15873684 0.1 0.7505974 0.499567637
## 0.15873684 0.2 0.7599877 0.520002282
## 0.15873684 0.3 0.7646783 0.529894353
## 0.15873684 0.4 0.7670098 0.534746804
## 0.15873684 0.5 0.7670067 0.534833415
## 0.15873684 0.6 0.7670037 0.534817001
## 0.15873684 0.7 0.7662194 0.533390456
## 0.15873684 0.8 0.7677880 0.536639087
## 0.15873684 0.9 0.7654412 0.532226601
## 0.15873684 1.0 0.7654412 0.532226601
## 0.15873684 1.1 0.7630944 0.527494241
## 0.15873684 1.2 0.7615257 0.524454776
## 0.15873684 1.3 0.7607475 0.522752541
## 0.15873684 1.4 0.7623162 0.525804947
## 0.15873684 1.5 0.7576287 0.516514113
## 0.15873684 1.6 0.7607567 0.522863569
## 0.15873684 1.7 0.7591912 0.519633908
## 0.15873684 1.8 0.7591881 0.519670093
## 0.15873684 1.9 0.7599694 0.521083886
## 0.15873684 2.0 0.7584069 0.518192467
## 0.21131579 0.1 0.7466881 0.490281688
## 0.21131579 0.2 0.7576348 0.514439857
## 0.21131579 0.3 0.7709252 0.542159541
## 0.21131579 0.4 0.7670129 0.534475869
## 0.21131579 0.5 0.7662255 0.533131813
## 0.21131579 0.6 0.7646599 0.530201072
## 0.21131579 0.7 0.7646569 0.530505849
## 0.21131579 0.8 0.7638787 0.528907467
## 0.21131579 0.9 0.7630944 0.527307584
## 0.21131579 1.0 0.7630944 0.527292735
## 0.21131579 1.1 0.7615288 0.524169593
## 0.21131579 1.2 0.7591820 0.519542239
## 0.21131579 1.3 0.7623039 0.525673419
## 0.21131579 1.4 0.7638725 0.528767016
## 0.21131579 1.5 0.7607475 0.522386590
## 0.21131579 1.6 0.7607537 0.522288420
## 0.21131579 1.7 0.7615380 0.523699048
## 0.21131579 1.8 0.7630974 0.526893275
## 0.21131579 1.9 0.7599755 0.520537600
## 0.21131579 2.0 0.7584099 0.517498112
## 0.26389474 0.1 0.7318229 0.458232757
## 0.26389474 0.2 0.7560692 0.510182963
## 0.26389474 0.3 0.7670159 0.533636801
## 0.26389474 0.4 0.7662347 0.532304082
## 0.26389474 0.5 0.7638848 0.528214836
## 0.26389474 0.6 0.7631005 0.526913006
## 0.26389474 0.7 0.7623192 0.525214492
## 0.26389474 0.8 0.7599663 0.520544210
## 0.26389474 0.9 0.7591789 0.518811916
## 0.26389474 1.0 0.7615227 0.523627919
## 0.26389474 1.1 0.7607414 0.522187582
## 0.26389474 1.2 0.7607445 0.521996766
## 0.26389474 1.3 0.7615257 0.523633230
## 0.26389474 1.4 0.7607475 0.521709159
## 0.26389474 1.5 0.7623131 0.524587749
## 0.26389474 1.6 0.7615349 0.523030693
## 0.26389474 1.7 0.7568413 0.513622858
## 0.26389474 1.8 0.7544975 0.508893792
## 0.26389474 1.9 0.7521538 0.504143540
## 0.26389474 2.0 0.7513756 0.502547463
## 0.31647368 0.1 0.7122672 0.416029746
## 0.31647368 0.2 0.7419945 0.480280059
## 0.31647368 0.3 0.7591850 0.516784557
## 0.31647368 0.4 0.7662255 0.531853046
## 0.31647368 0.5 0.7591850 0.518019515
## 0.31647368 0.6 0.7591820 0.518311559
## 0.31647368 0.7 0.7591820 0.518207480
## 0.31647368 0.8 0.7591850 0.518417621
## 0.31647368 0.9 0.7615257 0.522986771
## 0.31647368 1.0 0.7607475 0.521456912
## 0.31647368 1.1 0.7623131 0.524298118
## 0.31647368 1.2 0.7615257 0.522556445
## 0.31647368 1.3 0.7583946 0.516248675
## 0.31647368 1.4 0.7560539 0.511599911
## 0.31647368 1.5 0.7537132 0.506818185
## 0.31647368 1.6 0.7521477 0.503685812
## 0.31647368 1.7 0.7544945 0.508323838
## 0.31647368 1.8 0.7552757 0.509747706
## 0.31647368 1.9 0.7552757 0.509522360
## 0.31647368 2.0 0.7552757 0.509640395
## 0.36905263 0.1 0.6864614 0.359986896
## 0.36905263 0.2 0.7396415 0.474360312
## 0.36905263 0.3 0.7513725 0.499977518
## 0.36905263 0.4 0.7584007 0.515244024
## 0.36905263 0.5 0.7576134 0.514237252
## 0.36905263 0.6 0.7568352 0.512846891
## 0.36905263 0.7 0.7552696 0.509801109
## 0.36905263 0.8 0.7560478 0.511241196
## 0.36905263 0.9 0.7591759 0.517394819
## 0.36905263 1.0 0.7607384 0.520544399
## 0.36905263 1.1 0.7599602 0.518829704
## 0.36905263 1.2 0.7623009 0.523309259
## 0.36905263 1.3 0.7560539 0.510907624
## 0.36905263 1.4 0.7529289 0.504777706
## 0.36905263 1.5 0.7521415 0.503046663
## 0.36905263 1.6 0.7505760 0.499683402
## 0.36905263 1.7 0.7521415 0.502938299
## 0.36905263 1.8 0.7544884 0.507493188
## 0.36905263 1.9 0.7505760 0.499367140
## 0.36905263 2.0 0.7482261 0.494489860
## 0.42163158 0.1 0.6606526 0.302157018
## 0.42163158 0.2 0.7177420 0.427804116
## 0.42163158 0.3 0.7482384 0.492608054
## 0.42163158 0.4 0.7552788 0.508113567
## 0.42163158 0.5 0.7552757 0.508604062
## 0.42163158 0.6 0.7576134 0.513814832
## 0.42163158 0.7 0.7576134 0.513550943
## 0.42163158 0.8 0.7591759 0.516666118
## 0.42163158 0.9 0.7552635 0.508891051
## 0.42163158 1.0 0.7576103 0.513428602
## 0.42163158 1.1 0.7599571 0.518278636
## 0.42163158 1.2 0.7560539 0.510690496
## 0.42163158 1.3 0.7529289 0.504266366
## 0.42163158 1.4 0.7513634 0.501027597
## 0.42163158 1.5 0.7505760 0.499420162
## 0.42163158 1.6 0.7497947 0.497681713
## 0.42163158 1.7 0.7505760 0.499210749
## 0.42163158 1.8 0.7513572 0.500855272
## 0.42163158 1.9 0.7466605 0.491182035
## 0.42163158 2.0 0.7450919 0.488053808
## 0.47421053 0.1 0.6332996 0.240585208
## 0.47421053 0.2 0.7013235 0.391464592
## 0.47421053 0.3 0.7388511 0.471995562
## 0.47421053 0.4 0.7490227 0.494622340
## 0.47421053 0.5 0.7505882 0.498267856
## 0.47421053 0.6 0.7552727 0.507637837
## 0.47421053 0.7 0.7599540 0.517693558
## 0.47421053 0.8 0.7583977 0.514353209
## 0.47421053 0.9 0.7591850 0.516196229
## 0.47421053 1.0 0.7583946 0.514609100
## 0.47421053 1.1 0.7536979 0.505251730
## 0.47421053 1.2 0.7505790 0.499054648
## 0.47421053 1.3 0.7521385 0.502191955
## 0.47421053 1.4 0.7505760 0.498860735
## 0.47421053 1.5 0.7513572 0.500370672
## 0.47421053 1.6 0.7513603 0.500442875
## 0.47421053 1.7 0.7497947 0.497179301
## 0.47421053 1.8 0.7466575 0.490867724
## 0.47421053 1.9 0.7458762 0.489120565
## 0.47421053 2.0 0.7435294 0.484568065
## 0.52678947 0.1 0.6098560 0.185644545
## 0.52678947 0.2 0.6802022 0.344537965
## 0.52678947 0.3 0.7239951 0.439918627
## 0.52678947 0.4 0.7427574 0.480383247
## 0.52678947 0.5 0.7505821 0.497202720
## 0.52678947 0.6 0.7513634 0.498945124
## 0.52678947 0.7 0.7568321 0.510387255
## 0.52678947 0.8 0.7568321 0.510362205
## 0.52678947 0.9 0.7568321 0.510719444
## 0.52678947 1.0 0.7505729 0.498113828
## 0.52678947 1.1 0.7537071 0.504456001
## 0.52678947 1.2 0.7560539 0.509169387
## 0.52678947 1.3 0.7544853 0.506145579
## 0.52678947 1.4 0.7552665 0.507674127
## 0.52678947 1.5 0.7552696 0.507796920
## 0.52678947 1.6 0.7529228 0.502909268
## 0.52678947 1.7 0.7505699 0.498111100
## 0.52678947 1.8 0.7482261 0.493333771
## 0.52678947 1.9 0.7466575 0.490271278
## 0.52678947 2.0 0.7450950 0.487128366
## 0.57936842 0.1 0.5840349 0.124333927
## 0.57936842 0.2 0.6614369 0.302667989
## 0.57936842 0.3 0.7083548 0.405440622
## 0.57936842 0.4 0.7364982 0.466417862
## 0.57936842 0.5 0.7451072 0.484906790
## 0.57936842 0.6 0.7490135 0.493280155
## 0.57936842 0.7 0.7529228 0.501468968
## 0.57936842 0.8 0.7537102 0.503482087
## 0.57936842 0.9 0.7544853 0.505051800
## 0.57936842 1.0 0.7513634 0.498769210
## 0.57936842 1.1 0.7497978 0.496020010
## 0.57936842 1.2 0.7482322 0.492979345
## 0.57936842 1.3 0.7513572 0.499518537
## 0.57936842 1.4 0.7529228 0.502677676
## 0.57936842 1.5 0.7513572 0.499532909
## 0.57936842 1.6 0.7497917 0.496270170
## 0.57936842 1.7 0.7474449 0.491464180
## 0.57936842 1.8 0.7458732 0.488276513
## 0.57936842 1.9 0.7443076 0.485002146
## 0.57936842 2.0 0.7474357 0.491527502
## 0.63194737 0.1 0.5676042 0.082289405
## 0.63194737 0.2 0.6372120 0.248788850
## 0.63194737 0.3 0.6942678 0.373821496
## 0.63194737 0.4 0.7286826 0.448603211
## 0.63194737 0.5 0.7365074 0.466483913
## 0.63194737 0.6 0.7490165 0.492130357
## 0.63194737 0.7 0.7521415 0.499283992
## 0.63194737 0.8 0.7498009 0.494708439
## 0.63194737 0.9 0.7529289 0.501544653
## 0.63194737 1.0 0.7537071 0.502902762
## 0.63194737 1.1 0.7529289 0.501867408
## 0.63194737 1.2 0.7513603 0.499018350
## 0.63194737 1.3 0.7505790 0.497602994
## 0.63194737 1.4 0.7490135 0.494231131
## 0.63194737 1.5 0.7466605 0.489402649
## 0.63194737 1.6 0.7443107 0.484688914
## 0.63194737 1.7 0.7450888 0.486083319
## 0.63194737 1.8 0.7419577 0.479508963
## 0.63194737 1.9 0.7403983 0.476291137
## 0.63194737 2.0 0.7403922 0.476359585
## 0.68452632 0.1 0.5511949 0.043324004
## 0.68452632 0.2 0.6239277 0.217375834
## 0.68452632 0.3 0.6692616 0.319228091
## 0.68452632 0.4 0.7138143 0.415765797
## 0.68452632 0.5 0.7341544 0.459691193
## 0.68452632 0.6 0.7411949 0.475067643
## 0.68452632 0.7 0.7497917 0.493041015
## 0.68452632 0.8 0.7443290 0.482781431
## 0.68452632 0.9 0.7490165 0.492735883
## 0.68452632 1.0 0.7521415 0.499589210
## 0.68452632 1.1 0.7544914 0.504818546
## 0.68452632 1.2 0.7552665 0.506388989
## 0.68452632 1.3 0.7544822 0.504765681
## 0.68452632 1.4 0.7482200 0.491880344
## 0.68452632 1.5 0.7443015 0.483754018
## 0.68452632 1.6 0.7427359 0.480569967
## 0.68452632 1.7 0.7411765 0.477464004
## 0.68452632 1.8 0.7435202 0.481837112
## 0.68452632 1.9 0.7443015 0.483338461
## 0.68452632 2.0 0.7435202 0.481706394
## 0.73710526 0.1 0.5386887 0.014471254
## 0.73710526 0.2 0.6090717 0.182625171
## 0.73710526 0.3 0.6528493 0.282230457
## 0.73710526 0.4 0.7067739 0.399371887
## 0.73710526 0.5 0.7239828 0.437086785
## 0.73710526 0.6 0.7325919 0.456282871
## 0.73710526 0.7 0.7396293 0.471051944
## 0.73710526 0.8 0.7427574 0.478422416
## 0.73710526 0.9 0.7482230 0.490319617
## 0.73710526 1.0 0.7482322 0.491139733
## 0.73710526 1.1 0.7513572 0.497817978
## 0.73710526 1.2 0.7536949 0.502669195
## 0.73710526 1.3 0.7497855 0.494633491
## 0.73710526 1.4 0.7458670 0.486686226
## 0.73710526 1.5 0.7466483 0.488097289
## 0.73710526 1.6 0.7450827 0.484599139
## 0.73710526 1.7 0.7442953 0.482811557
## 0.73710526 1.8 0.7442984 0.482745881
## 0.73710526 1.9 0.7442984 0.482745881
## 0.73710526 2.0 0.7442984 0.482734321
## 0.78968421 0.1 0.5308824 -0.004244911
## 0.78968421 0.2 0.5996752 0.158699273
## 0.78968421 0.3 0.6403431 0.253696471
## 0.78968421 0.4 0.6903676 0.363056932
## 0.78968421 0.5 0.7106832 0.407823576
## 0.78968421 0.6 0.7247610 0.438572436
## 0.78968421 0.7 0.7302451 0.451144434
## 0.78968421 0.8 0.7396293 0.470955772
## 0.78968421 0.9 0.7411857 0.475147318
## 0.78968421 1.0 0.7450980 0.483895588
## 0.78968421 1.1 0.7450980 0.484521036
## 0.78968421 1.2 0.7474418 0.489288882
## 0.78968421 1.3 0.7443045 0.482649585
## 0.78968421 1.4 0.7443015 0.482529251
## 0.78968421 1.5 0.7443015 0.482402359
## 0.78968421 1.6 0.7435141 0.480640441
## 0.78968421 1.7 0.7427420 0.478963850
## 0.78968421 1.8 0.7427420 0.478963850
## 0.78968421 1.9 0.7435233 0.480587078
## 0.78968421 2.0 0.7458762 0.485114736
## 0.84226316 0.1 0.5324479 -0.001561577
## 0.84226316 0.2 0.5840319 0.121014326
## 0.84226316 0.3 0.6301746 0.229494949
## 0.84226316 0.4 0.6739553 0.327189751
## 0.84226316 0.5 0.7052206 0.395074715
## 0.84226316 0.6 0.7138113 0.414363154
## 0.84226316 0.7 0.7271078 0.443410111
## 0.84226316 0.8 0.7318107 0.454256381
## 0.84226316 0.9 0.7349357 0.461327989
## 0.84226316 1.0 0.7404075 0.473880555
## 0.84226316 1.1 0.7427482 0.479130440
## 0.84226316 1.2 0.7411857 0.475878397
## 0.84226316 1.3 0.7419608 0.477320494
## 0.84226316 1.4 0.7403952 0.474354077
## 0.84226316 1.5 0.7403952 0.473790641
## 0.84226316 1.6 0.7403983 0.473704801
## 0.84226316 1.7 0.7411826 0.475280291
## 0.84226316 1.8 0.7419669 0.476820395
## 0.84226316 1.9 0.7435294 0.480000100
## 0.84226316 2.0 0.7427451 0.478459996
## 0.89484211 0.1 0.5332292 0.000000000
## 0.89484211 0.2 0.5707353 0.089766937
## 0.89484211 0.3 0.6176685 0.200842245
## 0.89484211 0.4 0.6583027 0.292390751
## 0.89484211 0.5 0.6974081 0.377664361
## 0.89484211 0.6 0.7122518 0.410438112
## 0.89484211 0.7 0.7192862 0.426190690
## 0.89484211 0.8 0.7263297 0.441901698
## 0.89484211 0.9 0.7302420 0.450955438
## 0.89484211 1.0 0.7325858 0.456792902
## 0.89484211 1.1 0.7364951 0.465496219
## 0.89484211 1.2 0.7388358 0.470452697
## 0.89484211 1.3 0.7411795 0.475248160
## 0.89484211 1.4 0.7380515 0.468572894
## 0.89484211 1.5 0.7372733 0.466955506
## 0.89484211 1.6 0.7364920 0.465185778
## 0.89484211 1.7 0.7372733 0.466947617
## 0.89484211 1.8 0.7396170 0.471775246
## 0.89484211 1.9 0.7388358 0.470140241
## 0.89484211 2.0 0.7380545 0.468372706
## 0.94742105 0.1 0.5332292 0.000000000
## 0.94742105 0.2 0.5629167 0.070506484
## 0.94742105 0.3 0.6106219 0.183107274
## 0.94742105 0.4 0.6520496 0.277500553
## 0.94742105 0.5 0.6856832 0.351875369
## 0.94742105 0.6 0.7083425 0.401162247
## 0.94742105 0.7 0.7145956 0.415788395
## 0.94742105 0.8 0.7231985 0.434437475
## 0.94742105 0.9 0.7294577 0.448185834
## 0.94742105 1.0 0.7310263 0.453088492
## 0.94742105 1.1 0.7325888 0.457060901
## 0.94742105 1.2 0.7341483 0.460194362
## 0.94742105 1.3 0.7364920 0.464752919
## 0.94742105 1.4 0.7341483 0.459838960
## 0.94742105 1.5 0.7333670 0.458179704
## 0.94742105 1.6 0.7325797 0.456719832
## 0.94742105 1.7 0.7325766 0.456591473
## 0.94742105 1.8 0.7325797 0.456618774
## 0.94742105 1.9 0.7333609 0.458137104
## 0.94742105 2.0 0.7325766 0.456467002
## 1.00000000 0.1 0.5332292 0.000000000
## 1.00000000 0.2 0.5527512 0.047080576
## 1.00000000 0.3 0.5957690 0.148931489
## 1.00000000 0.4 0.6426838 0.256016554
## 1.00000000 0.5 0.6778585 0.334263229
## 1.00000000 0.6 0.6973989 0.377355644
## 1.00000000 0.7 0.7083456 0.401902057
## 1.00000000 0.8 0.7192892 0.425767768
## 1.00000000 0.9 0.7263266 0.441072578
## 1.00000000 1.0 0.7278952 0.445726709
## 1.00000000 1.1 0.7286765 0.448462723
## 1.00000000 1.2 0.7333701 0.458123886
## 1.00000000 1.3 0.7341544 0.459543219
## 1.00000000 1.4 0.7286765 0.448213323
## 1.00000000 1.5 0.7278922 0.446661319
## 1.00000000 1.6 0.7286703 0.448183662
## 1.00000000 1.7 0.7294516 0.449701921
## 1.00000000 1.8 0.7302359 0.451372293
## 1.00000000 1.9 0.7286703 0.448055121
## 1.00000000 2.0 0.7278891 0.446408681
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.2113158 and C = 0.3.
plot(caret.svm.radial)
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.svm.radial)
## sigma C Accuracy Kappa AccuracySD KappaSD
## 1 0.2113158 0.3 0.7709252 0.5421595 0.0123038 0.02308627
train_svm$y_pred_probs2 <- predict(caret.svm.radial, newdata = train_svm)
# train_svm$y_pred_probs2 <-
# ifelse(train_svm$y_pred_probs2$`1` > 0.5,
# train_svm$y_pred_probs2$`1`,
# 1-train_svm$y_pred_probs2$`0`)
# train_svm$y_pred2 <- ifelse(train_svm$y_pred_probs2 >
# 0.5, 1, 0)
# train_svm$y_pred_probs2 train_svm train_svm$y_pred2
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de SVM obtenido:
cm_train_svm_radial <- confusionMatrix(as.factor(train_svm$y_pred_probs2),
as.factor(train_svm$nota_vino), positive = "1")
cm_train_svm_radial$table
## Reference
## Prediction 0 1
## 0 498 157
## 1 99 525
# result
cm_train_svm_radial$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.8
cm_train_svm_radial$byClass["Recall"] %>%
round(2)
## Recall
## 0.77
cm_train_svm_radial$byClass["Precision"] %>%
round(2)
## Precision
## 0.84
Reproducimos la curva ROC sobre el modelo final de SVM obtenido:
roc_svm_radial <- plot.roc(as.numeric(train_svm$nota_vino), as.numeric(train_svm$y_pred_probs2),
col = "aquamarine4")
auc(roc_svm_radial)
## Area under the curve: 0.802
Se obtiene alrededor de un 80/81% de área bajo la curva.
train_km <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_km
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_km$nota_vino)
##
## 0 1
## 597 682
str(train_km)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
Quitamos la variable respuesta “quality” y nos quedamos solo con las 4 variables más representativas analizadas previamente (“alcohol”, “volatile_acidity”, “Log_sulphates”, “Log_total_sulfur_dioxide”:
train_kmeans <- train[, c(-1, -3,-4,-6,-7,-8,-9,-10)]
train_kmeans
## # A tibble: 1,279 × 4
## volatile_acidity alcohol Log_total_sulfur_dioxide Log_sulphates
## <dbl> <dbl> <dbl> <dbl>
## 1 0.48 10.3 2.77 -0.635
## 2 0.49 9 4.44 -0.545
## 3 1.02 10.5 4.44 -0.478
## 4 0.43 9.5 4.19 -0.446
## 5 0.59 9.7 3.97 -0.400
## 6 0.815 9.8 3.37 -0.673
## 7 0.21 10.4 3.14 -0.400
## 8 0.36 11 4.28 -0.386
## 9 0.49 14 4.47 -0.198
## 10 0.49 11.3 3.18 -0.186
## # … with 1,269 more rows
#alcohol + volatile_acidity + Log_sulphates + Log_total_sulfur_dioxide - nos quedamos solo con las representativas
Realizamos un escalado y estandarización de las variables para que no haya diferencia en las magnitudes:
train_kmeans_s<-scale(train_kmeans, center = TRUE, scale = TRUE)
#train_kmeans_s
Tratamos de buscar el valor óptimo de cluster a tener en nuestro modelo, aunque partimos de la premisa de tratar de buscar 2 cluster en función a si el vino aprueba o suspende:
fviz_nbclust(train_kmeans_s, kmeans, method = "wss")
No vemos un codo muy definido, y el k optimo podría estar entre 3 y 7 clusters.
Otra forma de buscar el óptimo:
wholesaleBest = FitKMeans(train_kmeans_s, max.clusters = 10, nstart = 25, seed = 666)
wholesaleBest
## Clusters Hartigan AddCluster
## 1 2 445.04986 TRUE
## 2 3 225.70323 TRUE
## 3 4 223.17070 TRUE
## 4 5 141.34148 TRUE
## 5 6 130.06708 TRUE
## 6 7 97.50276 TRUE
## 7 8 95.23648 TRUE
## 8 9 90.17194 TRUE
## 9 10 70.59436 TRUE
PlotHartigan(wholesaleBest)
o vemos un codo muy definido, y el k optimo podría estar entre 3 y 7 clusters.
#calculate gap statistic based on number of clusters
gap_stat <- clusGap(train_kmeans_s,
FUN = kmeans,
nstart = 25,
K.max = 10,
B = 50)
#plot number of clusters vs. gap statistic
fviz_gap_stat(gap_stat)
Desarrollamos el Clustering con K-means con el número óptimo de centroides “K” = 2:
#make this example reproducible
set.seed(666)
km2 <- kmeans(train_kmeans_s, centers = 2, nstart = 25)
#view results
km2
## K-means clustering with 2 clusters of sizes 718, 561
##
## Cluster means:
## volatile_acidity alcohol Log_total_sulfur_dioxide Log_sulphates
## 1 0.4969353 -0.5388834 0.2352972 -0.4630168
## 2 -0.6360064 0.6896939 -0.3011469 0.5925955
##
## Clustering vector:
## [1] 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 2 2 1 1 2 2 1 1 1
## [38] 1 2 1 2 1 1 1 2 2 2 1 1 1 1 1 2 1 1 2 2 1 1 2 2 2 1 2 2 1 1 2 1 2 2 2 1 1
## [75] 1 2 1 2 2 1 1 1 2 2 1 1 1 1 1 2 1 2 2 2 1 1 1 2 2 2 1 2 1 1 1 1 2 1 1 2 2
## [112] 1 1 1 1 1 2 1 1 1 1 2 1 1 2 2 2 1 2 2 2 1 2 2 1 2 1 2 1 2 1 2 2 1 2 2 1 1
## [149] 2 2 1 1 2 2 1 1 2 2 2 2 2 1 1 1 2 1 1 2 1 1 1 2 2 2 1 2 1 1 2 1 1 1 1 1 2
## [186] 2 2 1 2 1 1 2 1 2 1 1 2 2 1 2 2 2 1 2 1 2 1 2 1 1 1 2 1 2 2 2 2 2 1 2 2 2
## [223] 1 2 2 1 1 1 2 2 2 2 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1 2 2 1 2 2 2 2 1 2 1 1 2
## [260] 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 2 2 1 2 1 2 2 1 2 1 1 1 1 1 1 2 2 1 2 1 2
## [297] 2 2 1 1 2 1 1 1 2 2 2 2 1 2 1 2 1 1 2 2 2 1 2 1 1 2 1 2 1 1 1 2 1 2 2 1 1
## [334] 1 1 2 2 2 2 1 1 1 2 1 2 2 1 1 2 1 1 1 2 1 2 1 1 1 1 1 2 2 2 1 2 1 1 1 2 1
## [371] 1 1 2 2 2 1 1 2 2 1 1 1 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 1 1 1 2 2 2 1 2
## [408] 2 2 1 1 2 1 1 1 2 1 1 2 1 1 2 2 1 2 1 1 1 1 1 2 2 1 2 2 1 1 1 1 1 1 2 2 2
## [445] 2 2 2 1 1 2 2 2 1 2 2 2 1 1 2 1 1 2 1 1 2 2 2 2 1 1 2 1 2 1 2 2 1 2 1 2 1
## [482] 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 2 2 1 1 1 1 2 1 2 2 1 1 1 2
## [519] 1 2 1 2 1 1 1 1 2 1 2 1 2 1 1 2 2 2 2 1 1 2 1 2 2 1 2 2 2 2 1 1 1 1 1 2 1
## [556] 1 1 2 1 1 1 1 2 1 1 2 2 2 1 2 1 2 1 1 1 1 1 2 1 2 2 1 2 1 1 1 1 1 2 1 1 2
## [593] 1 1 1 2 1 1 1 2 2 1 2 2 1 1 2 1 2 1 1 1 2 1 1 1 1 2 2 1 2 2 1 1 2 1 1 1 2
## [630] 2 2 1 2 2 2 2 1 2 1 2 2 1 1 2 1 2 2 1 2 1 1 2 2 2 1 2 1 2 1 1 2 2 2 2 1 2
## [667] 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1 2 1 1 2 2 1 2 1 2 1 2 2 1 2 1 1
## [704] 2 1 1 2 1 1 2 1 1 1 2 2 1 2 1 2 1 2 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [741] 2 1 2 1 2 2 1 1 1 2 1 2 1 1 2 1 2 1 2 2 1 1 2 2 2 1 2 1 1 2 1 2 1 1 1 2 1
## [778] 2 1 1 2 2 2 1 1 2 1 2 2 2 1 1 1 2 2 1 2 2 1 1 2 2 2 2 1 2 2 1 1 1 1 2 1 1
## [815] 1 1 2 1 1 1 1 1 1 1 2 1 1 2 2 1 2 1 2 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 1 2 2
## [852] 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 2 1 2 2 1 1 2 2 1 2 1 1 2 2 2 1
## [889] 1 1 2 1 2 1 2 1 1 2 1 2 1 1 1 1 2 1 2 1 2 1 2 1 1 2 2 1 1 1 2 1 2 2 1 1 2
## [926] 2 2 2 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 2 1 1 1 1 2 1 2
## [963] 1 1 1 1 2 2 2 1 2 1 1 2 1 2 1 1 1 2 1 1 2 1 2 1 1 2 2 2 2 2 2 2 1 1 1 2 1
## [1000] 2 1 2 1 1 1 1 2 1 2 2 1 1 1 2 2 1 1 1 2 1 1 1 1 2 1 2 1 1 2 1 1 2 1 1 1 1
## [1037] 2 2 1 1 1 1 1 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 1 1 1 2 1 1 1 2 2 2 2 2
## [1074] 1 1 2 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 2 1 2 1 2 2 1 2 1 2 1 2 2 2
## [1111] 1 1 1 2 1 1 1 2 2 2 2 2 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [1148] 1 2 2 1 1 1 2 1 2 1 1 2 1 2 2 1 1 1 2 2 1 1 2 1 1 1 1 2 2 1 1 2 2 1 2 1 1
## [1185] 1 1 1 2 2 2 1 2 2 1 1 2 1 1 1 1 2 2 2 2 2 2 1 2 1 2 1 2 1 1 2 1 1 1 2 2 2
## [1222] 1 1 1 2 1 2 2 2 1 1 1 2 1 2 2 2 1 1 2 1 1 2 2 1 1 1 2 1 2 2 2 1 1 2 2 2 2
## [1259] 2 2 1 1 2 1 1 1 2 2 2 2 2 2 1 2 1 2 1 1 2
##
## Within cluster sum of squares by cluster:
## [1] 1857.765 1933.080
## (between_SS / total_SS = 25.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
km2$size
## [1] 718 561
km2$centers
## volatile_acidity alcohol Log_total_sulfur_dioxide Log_sulphates
## 1 0.4969353 -0.5388834 0.2352972 -0.4630168
## 2 -0.6360064 0.6896939 -0.3011469 0.5925955
Graficamos los resultados obtenidos:
fviz_cluster(km2, data = train_kmeans_s, geom = "point", show.clust.cent=TRUE)
Vemos la media de los valores para cada uno de los diferentes clusters y su tamaño:
aggregate(train_kmeans_s, by=list(cluster=km2$cluster), mean)
## cluster volatile_acidity alcohol Log_total_sulfur_dioxide Log_sulphates
## 1 1 0.4969353 -0.5388834 0.2352972 -0.4630168
## 2 2 -0.6360064 0.6896939 -0.3011469 0.5925955
km2$size
## [1] 718 561
Vemos como se reparten los cluster en los diferentes grupos según la variable “nota_vino”
table(km2$cluster, train_km$nota_vino)
##
## 0 1
## 1 479 239
## 2 118 443
Obtenemos un accuracy entorno al 70% realizando un cluster con 2 centroides.
Vemos como se reparten los cluster en los diferentes grupos según la variable “quality”
table(km2$cluster, train$quality)
##
## 3 4 5 6 7 8
## 1 7 31 441 224 15 0
## 2 0 7 111 289 141 13
Desarrollamos el Clustering con K-means con el número óptimo de centroides “K” = 6:
#make this example reproducible
set.seed(666)
km6 <- kmeans(train_kmeans_s, centers = 6, nstart = 25)
#view results
km6
## K-means clustering with 6 clusters of sizes 195, 111, 165, 313, 171, 324
##
## Cluster means:
## volatile_acidity alcohol Log_total_sulfur_dioxide Log_sulphates
## 1 -0.7519441 0.9335113 -1.1993829 0.08678682
## 2 1.8977342 0.1497254 -0.4409611 -0.77013515
## 3 -0.6560209 -0.2397655 0.1261707 1.69884929
## 4 0.2333454 -0.7013162 1.0876603 -0.39408940
## 5 -0.4034328 1.4622998 0.4044269 0.26648856
## 6 0.1239937 -0.5852908 -0.4555139 -0.41348186
##
## Clustering vector:
## [1] 6 4 2 4 4 2 1 5 5 3 5 3 6 5 3 4 1 5 5 1 3 1 3 6 1 1 6 1 6 1 2 6 5 1 6 2 6
## [38] 4 1 6 5 4 6 6 3 1 1 4 6 4 4 4 1 6 6 1 5 4 6 3 3 1 6 6 6 6 2 3 6 1 3 1 4 6
## [75] 4 5 2 5 5 4 6 4 3 4 4 4 6 4 6 3 4 5 1 1 4 6 2 3 3 1 4 1 6 6 2 6 3 6 4 1 1
## [112] 4 6 4 6 6 5 4 4 4 4 5 4 6 1 3 5 4 1 3 1 6 3 1 4 5 4 5 2 1 4 6 5 6 3 6 6 4
## [149] 5 1 6 6 3 1 2 6 3 1 5 1 1 2 6 6 1 4 6 1 2 4 4 5 3 1 6 3 4 2 1 4 6 4 6 6 3
## [186] 3 6 6 3 6 4 1 6 5 4 6 3 3 6 1 3 5 4 5 6 1 4 1 6 4 4 3 6 6 5 5 3 1 6 1 3 3
## [223] 4 4 3 6 4 3 3 3 5 1 4 4 6 4 6 1 1 4 6 5 2 5 2 6 4 3 1 4 5 1 1 4 6 1 6 2 1
## [260] 6 6 2 1 4 6 6 6 5 6 1 4 6 6 2 1 5 3 6 3 2 3 1 5 1 6 6 2 6 6 6 1 3 6 1 2 1
## [297] 5 5 4 4 1 6 6 6 1 5 1 3 4 5 6 5 4 6 3 1 3 6 1 4 2 3 6 1 6 4 6 5 4 3 1 6 4
## [334] 6 6 1 6 6 3 4 6 6 1 4 1 5 4 2 3 4 4 4 5 6 1 6 4 4 4 4 5 1 1 2 1 4 6 4 3 4
## [371] 6 4 1 3 5 4 2 3 5 6 6 6 6 3 1 6 3 4 6 5 1 1 6 1 4 6 4 3 6 4 4 6 1 1 1 4 1
## [408] 1 5 4 6 3 4 4 6 6 6 6 5 6 4 1 6 4 1 6 6 6 6 6 3 3 4 1 5 6 4 2 4 4 6 3 1 5
## [445] 5 1 1 6 2 3 3 3 4 5 5 6 6 4 3 4 6 3 4 5 1 3 5 1 4 2 5 4 3 5 1 3 6 5 6 3 6
## [482] 4 4 1 2 2 1 1 6 6 2 1 6 6 6 4 6 4 2 5 2 1 3 2 3 6 6 6 4 6 3 6 3 1 6 6 4 1
## [519] 6 1 6 5 4 2 6 5 1 4 3 6 5 2 6 5 1 5 1 6 6 3 4 1 1 2 6 5 2 5 2 2 2 6 2 1 2
## [556] 4 4 3 6 2 2 6 3 2 6 5 3 5 4 5 5 5 4 6 2 6 5 3 2 1 3 4 5 4 4 4 2 4 5 6 6 3
## [593] 2 6 4 1 4 4 4 4 1 4 5 3 6 6 5 3 5 6 6 6 5 6 4 6 6 1 5 6 3 5 4 6 3 4 6 2 3
## [630] 5 3 4 3 1 3 5 4 3 4 3 3 4 6 3 4 1 3 3 5 6 4 5 4 3 6 3 2 5 4 6 1 5 3 5 6 6
## [667] 6 4 6 2 5 4 6 6 6 4 1 6 6 1 4 6 3 4 6 5 4 1 4 4 3 5 2 5 6 3 4 5 1 4 4 4 4
## [704] 4 4 4 1 2 4 5 4 6 6 5 3 2 3 6 1 6 1 3 4 3 1 4 4 4 4 4 6 4 2 6 4 4 4 6 5 4
## [741] 1 4 5 4 6 5 4 2 4 1 2 5 4 4 6 4 6 2 1 1 4 6 5 3 2 6 5 2 4 1 4 1 3 2 4 3 2
## [778] 5 6 6 5 1 3 2 6 3 6 4 3 1 4 6 4 1 5 6 3 1 6 4 3 5 5 5 4 1 3 4 2 6 6 3 2 4
## [815] 2 6 5 6 4 6 6 4 2 6 5 6 6 4 1 3 5 5 3 2 4 4 5 4 2 4 6 4 4 1 4 6 6 2 4 1 1
## [852] 4 2 1 3 6 2 3 2 6 4 4 4 4 2 4 3 2 5 5 2 4 1 4 1 5 4 2 1 1 6 6 4 4 2 1 3 4
## [889] 4 4 5 6 5 6 5 4 6 3 4 5 4 4 6 6 3 6 1 4 3 4 3 6 4 5 3 4 4 6 1 2 3 5 4 4 5
## [926] 3 1 1 6 6 4 6 4 5 6 4 4 1 5 6 4 6 4 5 6 3 6 4 1 3 6 4 6 4 1 4 6 6 5 6 2 1
## [963] 4 4 4 6 1 3 1 6 1 4 2 6 4 5 4 3 4 3 2 4 1 4 1 6 6 5 5 1 6 1 3 4 4 3 6 5 6
## [1000] 3 6 5 4 4 4 2 1 4 1 1 6 4 4 1 5 4 6 4 3 4 6 6 2 6 6 5 4 6 1 6 6 3 6 4 6 4
## [1037] 5 3 4 6 2 4 6 3 4 5 3 4 6 1 1 6 4 3 1 4 2 5 6 6 1 2 6 6 1 6 6 4 5 5 5 3 5
## [1074] 6 4 3 2 5 4 5 1 4 6 5 4 4 4 3 6 2 6 2 6 5 3 2 3 4 3 6 5 1 6 5 6 5 4 5 5 6
## [1111] 2 6 6 1 4 4 2 5 2 1 6 1 2 5 6 2 4 6 6 3 4 1 4 4 6 4 3 4 2 2 4 4 6 3 4 6 4
## [1148] 4 6 1 6 4 6 5 6 1 6 4 1 4 1 3 6 2 4 5 1 4 4 6 4 2 2 6 1 1 6 4 4 1 4 5 6 6
## [1185] 6 4 6 3 1 6 2 3 1 6 6 1 4 6 4 4 1 5 3 3 3 3 6 5 4 1 6 5 4 4 5 4 4 4 5 5 1
## [1222] 2 4 2 5 2 1 1 3 4 6 4 3 2 5 6 3 2 6 3 6 6 5 5 6 4 6 5 2 1 5 1 4 4 6 1 1 1
## [1259] 3 5 6 4 3 6 4 4 6 5 3 5 5 1 6 5 4 5 4 2 3
##
## Within cluster sum of squares by cluster:
## [1] 345.2699 285.3923 409.7258 412.8197 366.0382 419.5249
## (between_SS / total_SS = 56.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
km6$size
## [1] 195 111 165 313 171 324
km6$centers
## volatile_acidity alcohol Log_total_sulfur_dioxide Log_sulphates
## 1 -0.7519441 0.9335113 -1.1993829 0.08678682
## 2 1.8977342 0.1497254 -0.4409611 -0.77013515
## 3 -0.6560209 -0.2397655 0.1261707 1.69884929
## 4 0.2333454 -0.7013162 1.0876603 -0.39408940
## 5 -0.4034328 1.4622998 0.4044269 0.26648856
## 6 0.1239937 -0.5852908 -0.4555139 -0.41348186
Graficamos los resultados obtenidos:
fviz_cluster(km6, data = train_kmeans_s, geom = "point", show.clust.cent=TRUE)
Vemos la media de los valores para cada uno de los diferentes clusters y su tamaño:
aggregate(train_kmeans_s, by=list(cluster=km6$cluster), mean)
## cluster volatile_acidity alcohol Log_total_sulfur_dioxide Log_sulphates
## 1 1 -0.7519441 0.9335113 -1.1993829 0.08678682
## 2 2 1.8977342 0.1497254 -0.4409611 -0.77013515
## 3 3 -0.6560209 -0.2397655 0.1261707 1.69884929
## 4 4 0.2333454 -0.7013162 1.0876603 -0.39408940
## 5 5 -0.4034328 1.4622998 0.4044269 0.26648856
## 6 6 0.1239937 -0.5852908 -0.4555139 -0.41348186
km6$size
## [1] 195 111 165 313 171 324
Vemos como se reparten los cluster en los diferentes grupos según la variable “quality”
table(km6$cluster, train$quality)
##
## 3 4 5 6 7 8
## 1 0 4 30 96 59 6
## 2 5 16 47 42 1 0
## 3 0 0 45 84 35 1
## 4 1 5 228 77 2 0
## 5 0 3 20 94 48 6
## 6 1 10 182 120 11 0
gam_mod <- gam(quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) + s(citric_acid) +
s(fixed_acidity), data = train, method = "REML")
gam_mod
##
## Family: gaussian
## Link function: identity
##
## Formula:
## quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
## s(citric_acid) + s(fixed_acidity)
##
## Estimated degrees of freedom:
## 3.80 1.00 3.69 1.00 2.25 2.94 1.99
## 2.55 total = 20.23
##
## REML score: 1221.011
coef(gam_mod)
## (Intercept) s(alcohol).1
## 5.634871e+00 1.648020e-01
## s(alcohol).2 s(alcohol).3
## -1.165238e-01 7.041699e-02
## s(alcohol).4 s(alcohol).5
## -9.552495e-02 -4.469214e-02
## s(alcohol).6 s(alcohol).7
## 1.131493e-01 -4.810090e-02
## s(alcohol).8 s(alcohol).9
## 6.033374e-01 8.274343e-02
## s(volatile_acidity).1 s(volatile_acidity).2
## -4.570556e-05 1.038005e-06
## s(volatile_acidity).3 s(volatile_acidity).4
## -1.351067e-05 -1.793403e-05
## s(volatile_acidity).5 s(volatile_acidity).6
## 1.458357e-05 1.751913e-05
## s(volatile_acidity).7 s(volatile_acidity).8
## -1.271810e-05 1.597256e-04
## s(volatile_acidity).9 s(Log_sulphates).1
## -1.701127e-01 -7.829302e-03
## s(Log_sulphates).2 s(Log_sulphates).3
## -7.099365e-02 4.537419e-02
## s(Log_sulphates).4 s(Log_sulphates).5
## -3.268186e-02 -4.713404e-02
## s(Log_sulphates).6 s(Log_sulphates).7
## 1.907383e-02 1.687056e-02
## s(Log_sulphates).8 s(Log_sulphates).9
## 1.915352e-01 1.288852e-01
## s(Log_chlorides).1 s(Log_chlorides).2
## 8.907392e-06 -3.929104e-07
## s(Log_chlorides).3 s(Log_chlorides).4
## -4.775713e-06 -2.441926e-06
## s(Log_chlorides).5 s(Log_chlorides).6
## 2.867426e-06 5.221761e-07
## s(Log_chlorides).7 s(Log_chlorides).8
## -2.957613e-06 -1.322006e-05
## s(Log_chlorides).9 s(pH).1
## -6.951358e-02 -9.276906e-03
## s(pH).2 s(pH).3
## 1.043393e-02 -9.687645e-03
## s(pH).4 s(pH).5
## -1.755619e-02 2.316811e-03
## s(pH).6 s(pH).7
## 1.786883e-02 9.528146e-03
## s(pH).8 s(pH).9
## 1.380056e-01 -9.893089e-02
## s(Log_total_sulfur_dioxide).1 s(Log_total_sulfur_dioxide).2
## 1.264371e-01 -7.861424e-03
## s(Log_total_sulfur_dioxide).3 s(Log_total_sulfur_dioxide).4
## 3.959153e-02 -2.883118e-02
## s(Log_total_sulfur_dioxide).5 s(Log_total_sulfur_dioxide).6
## -3.644112e-02 3.414453e-02
## s(Log_total_sulfur_dioxide).7 s(Log_total_sulfur_dioxide).8
## 2.681938e-02 1.658594e-01
## s(Log_total_sulfur_dioxide).9 s(citric_acid).1
## 4.615124e-02 -8.294107e-03
## s(citric_acid).2 s(citric_acid).3
## 2.156461e-02 -8.422364e-03
## s(citric_acid).4 s(citric_acid).5
## 1.810441e-02 7.342591e-03
## s(citric_acid).6 s(citric_acid).7
## -2.160911e-02 7.256404e-03
## s(citric_acid).8 s(citric_acid).9
## -1.070232e-01 -3.440155e-02
## s(fixed_acidity).1 s(fixed_acidity).2
## -3.380473e-03 -4.880922e-02
## s(fixed_acidity).3 s(fixed_acidity).4
## -2.163905e-02 -4.395839e-02
## s(fixed_acidity).5 s(fixed_acidity).6
## 1.957378e-02 -4.981481e-02
## s(fixed_acidity).7 s(fixed_acidity).8
## -2.147517e-02 2.926902e-01
## s(fixed_acidity).9
## -3.132917e-02
summary(gam_mod)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
## s(citric_acid) + s(fixed_acidity)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.63487 0.01713 328.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(alcohol) 3.799 4.762 54.064 < 2e-16 ***
## s(volatile_acidity) 1.003 1.006 54.789 < 2e-16 ***
## s(Log_sulphates) 3.691 4.634 23.398 < 2e-16 ***
## s(Log_chlorides) 1.000 1.001 11.922 0.000572 ***
## s(pH) 2.252 2.906 5.016 0.001882 **
## s(Log_total_sulfur_dioxide) 2.943 3.724 3.550 0.008809 **
## s(citric_acid) 1.991 2.508 2.440 0.062463 .
## s(fixed_acidity) 2.553 3.260 1.909 0.134935
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.404 Deviance explained = 41.3%
## -REML = 1221 Scale est. = 0.37532 n = 1279
Falta la variable log clorides——-
gam_mod2 = gam(quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
s(Log_chlorides), data = train, method = "REML")
summary(gam_mod2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_chlorides)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6349 0.0174 323.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(alcohol) 4.054 5.062 52.020 <2e-16 ***
## s(volatile_acidity) 1.001 1.003 79.945 <2e-16 ***
## s(Log_sulphates) 3.730 4.669 23.833 <2e-16 ***
## s(Log_chlorides) 4.316 5.382 2.686 0.0169 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.385 Deviance explained = 39.1%
## -REML = 1230.7 Scale est. = 0.38729 n = 1279
plot(gam_mod2, residuals = TRUE, pch = 1, shade = TRUE, shade.col = "lightgreen")
gam.check(gam_mod2)
##
## Method: REML Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-9.689108e-05,0.0002921235]
## (score 1230.694 & scale 0.3872926).
## Hessian positive definite, eigenvalue range [9.662209e-05,637.0106].
## Model rank = 37 / 37
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(alcohol) 9.00 4.05 1.03 0.82
## s(volatile_acidity) 9.00 1.00 0.96 0.09 .
## s(Log_sulphates) 9.00 3.73 0.97 0.14
## s(Log_chlorides) 9.00 4.32 1.01 0.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lo primero de todo, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”)
# https://noamross.github.io/gams-in-r-course/chapter4
train_gam <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_gam
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_gam$nota_vino)
##
## 0 1
## 597 682
Para aplicar GAM logstico a nuestro problema, utilizamos el paquete mgcv y la familia=binomial que indica a la función GAM que nuestra variable respuesta será 0 o 1, es decir, vino bueno o vino malo. Las variables están envueltas por la función s, que es una función que espeficia que queremos que la relación sea flexible.
gam_mod_log = gam(nota_vino ~ s(alcohol) + s(volatile_acidity) +
s(Log_sulphates) + s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
s(citric_acid) + s(fixed_acidity), data = train_gam, method = "REML",
family = binomial)
summary(gam_mod_log)
##
## Family: binomial
## Link function: logit
##
## Formula:
## nota_vino ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
## s(citric_acid) + s(fixed_acidity)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.30934 0.07799 3.966 7.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(alcohol) 5.846 6.974 143.805 < 2e-16 ***
## s(volatile_acidity) 1.855 2.377 27.641 3.25e-06 ***
## s(Log_sulphates) 2.484 3.169 61.194 < 2e-16 ***
## s(Log_chlorides) 6.015 7.183 18.378 0.0121 *
## s(pH) 2.416 3.119 6.121 0.1115
## s(Log_total_sulfur_dioxide) 3.741 4.706 28.623 2.87e-05 ***
## s(citric_acid) 1.490 1.828 6.633 0.0191 *
## s(fixed_acidity) 2.535 3.226 6.556 0.1200
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.371 Deviance explained = 32.2%
## -REML = 642.13 Scale est. = 1 n = 1279
Nos quedamos solo con las variables más significativas (tres ***).
gam_mod_log2 = gam(nota_vino ~ s(alcohol) + s(volatile_acidity) +
s(Log_sulphates) + s(Log_total_sulfur_dioxide), data = train_gam,
method = "REML", family = binomial)
summary(gam_mod_log2)
##
## Family: binomial
## Link function: logit
##
## Formula:
## nota_vino ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_total_sulfur_dioxide)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26735 0.07471 3.579 0.000345 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(alcohol) 5.927 7.058 150.80 < 2e-16 ***
## s(volatile_acidity) 1.049 1.095 34.16 < 2e-16 ***
## s(Log_sulphates) 4.078 5.044 54.78 < 2e-16 ***
## s(Log_total_sulfur_dioxide) 3.557 4.487 35.85 5.63e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.346 Deviance explained = 29.4%
## -REML = 649.89 Scale est. = 1 n = 1279
Aquí, el valor del intercepto es 0.26735 que esta en escala logaritmica. Podemos utilizar la función logística plogis() para convertirlo en una probabilidad.
plogis(0.26773)
## [1] 0.5665355
plogis(coef(gam_mod_log2)[1])
## (Intercept)
## 0.566441
Esto significa que el modelo predice una probabilidad base del 57% de un resultado positivo, es decir que un vino esta aprobado.
plot(gam_mod_log2, residuals = TRUE, pch = 1, shade = TRUE, shade.col = "lightblue", trans = plogis, shift = coef(gam_mod_log2)[1],
seWithMean = TRUE, col = "purple")
#predict(gam_mod_log2, type="response", se.fit = TRUE)
#plogis(predict(gam_mod_log2, type="link"))
Calculamos la prediccion y los errores
predictions <- predict(gam_mod_log2, newdata = train_gam,
type = "link", se.fit = TRUE)
Explicando los predictores
head(predict(gam_mod_log2, type = "terms"))
## s(alcohol) s(volatile_acidity) s(Log_sulphates) s(Log_total_sulfur_dioxide)
## 1 -0.153936041 0.11651423 -0.54774301 0.15593321
## 2 -0.891143170 0.09074145 -0.24318835 -0.49397254
## 3 0.007793508 -1.25421557 -0.01908907 -0.49397254
## 4 -1.063371315 0.24564085 0.09246652 -0.13861017
## 5 -0.910318840 -0.16603197 0.25861731 0.09886004
## 6 -0.792699732 -0.73698784 -0.68971449 0.37393429
predict(gam_mod_log2, type = "terms")[1, ]
## s(alcohol) s(volatile_acidity)
## -0.1539360 0.1165142
## s(Log_sulphates) s(Log_total_sulfur_dioxide)
## -0.5477430 0.1559332
plogis(sum(predict(gam_mod_log2, type = "terms")) + coef(gam_mod_log2)[1])
## (Intercept)
## 0.566441
Los modelos que mejor encajan para la clasificación de los vino entre aprobados (quality =>6) y suspenso (quality <6) son los siguientes:
modelos <- list(KNN = caret.knn, GLM = caret.glm,
DT = caret.tree, RF = caret.rf,
ADA = caret.ada, XGB = caret.xgb)
resultados_resamples <- resamples(modelos)
#resultados_resamples$values
#colMeans(resultados_resamples$values[,-1])
resultados_modelos<-resultados_resamples$values[,-1]%>%
summarise_all(mean)
resultados_modelos
## KNN~Accuracy KNN~Kappa GLM~Accuracy GLM~Kappa DT~Accuracy DT~Kappa
## 1 0.7607445 0.5210021 0.7482917 0.4956541 0.7255101 0.4484141
## RF~Accuracy RF~Kappa ADA~Accuracy ADA~Kappa XGB~Accuracy XGB~Kappa
## 1 0.7646311 0.5276577 0.7607643 0.520692 0.7967279 0.5917263
resultados_modelos_acc<-resultados_modelos[,c(-2,-4,-6,-8,-10,-12)]
resultados_finales_acc<-t(resultados_modelos_acc)
colnames(resultados_finales_acc)<-'Accuracy'
resultados_finales_acc
## Accuracy
## KNN~Accuracy 0.7607445
## GLM~Accuracy 0.7482917
## DT~Accuracy 0.7255101
## RF~Accuracy 0.7646311
## ADA~Accuracy 0.7607643
## XGB~Accuracy 0.7967279
resultados_modelos_kap<-resultados_modelos[,c(-1,-3,-5,-7,-9,-11)]
resultados_finales_kap<-t(resultados_modelos_kap)
colnames(resultados_finales_kap)<-'Kappa'
resultados_finales_kap
## Kappa
## KNN~Kappa 0.5210021
## GLM~Kappa 0.4956541
## DT~Kappa 0.4484141
## RF~Kappa 0.5276577
## ADA~Kappa 0.5206920
## XGB~Kappa 0.5917263
GLM: Con un accuracy obtenido entorno al 75% y un AUC de aproximadamente un 82%.
caret.glm
## Generalized Linear Model
##
## 1279 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1024, 1023, 1022, 1023
## Resampling results:
##
## Accuracy Kappa
## 0.7482917 0.4956541
auc(roc_glm)
## Area under the curve: 0.8195
KNN: Con un accuracy obtenido entorno al 76% y un AUC de aproximadamente un 83%.
caret.knn
## k-Nearest Neighbors
##
## 1279 samples
## 11 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1023, 1023, 1023, 1023, 1024
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.7529259 0.5024638
## 3 0.7271293 0.4496676
## 5 0.7451256 0.4857637
## 7 0.7279075 0.4514521
## 9 0.7372702 0.4712542
## 11 0.7497855 0.4961405
## 13 0.7474357 0.4921884
## 15 0.7443045 0.4863425
## 17 0.7419700 0.4817875
## 19 0.7403860 0.4788672
## 21 0.7403922 0.4788520
## 23 0.7482108 0.4949287
## 25 0.7443076 0.4870466
## 27 0.7435202 0.4857680
## 29 0.7404013 0.4793634
## 31 0.7443107 0.4873547
## 33 0.7443168 0.4876106
## 35 0.7466575 0.4922032
## 37 0.7474357 0.4937484
## 39 0.7474387 0.4936762
## 41 0.7435294 0.4852850
## 43 0.7419669 0.4824611
## 45 0.7396201 0.4774906
## 47 0.7458732 0.4903228
## 49 0.7435263 0.4853138
## 51 0.7419669 0.4826133
## 53 0.7505576 0.4996890
## 55 0.7505607 0.4998633
## 57 0.7497794 0.4980862
## 59 0.7536918 0.5055618
## 61 0.7560386 0.5102242
## 63 0.7560417 0.5104059
## 65 0.7583885 0.5153964
## 67 0.7544761 0.5079165
## 69 0.7599510 0.5183972
## 71 0.7599540 0.5181401
## 73 0.7560355 0.5106282
## 75 0.7536918 0.5056467
## 77 0.7576011 0.5138357
## 79 0.7560509 0.5110409
## 81 0.7544761 0.5080859
## 83 0.7552574 0.5095624
## 85 0.7552543 0.5096503
## 87 0.7529136 0.5050209
## 89 0.7576072 0.5145375
## 91 0.7599540 0.5192917
## 93 0.7552635 0.5098494
## 95 0.7552727 0.5096173
## 97 0.7591850 0.5178204
## 99 0.7607445 0.5210021
## 101 0.7583946 0.5161423
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 99.
auc(roc_knn)
## Area under the curve: 0.8268
Decision Tree: Con un accuracy obtenido entorno al 73% y un AUC de aproximadamente un 78%.
caret.tree
## CART
##
## 1279 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1022, 1024, 1023, 1023, 1024
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00000000 0.7239386 0.4455300
## 0.01930706 0.7255101 0.4484141
## 0.03861412 0.7215885 0.4421371
## 0.05792118 0.7004855 0.4069132
## 0.07722825 0.7004855 0.4069132
## 0.09653531 0.7004855 0.4069132
## 0.11584237 0.7004855 0.4069132
## 0.13514943 0.7004855 0.4069132
## 0.15445649 0.7004855 0.4069132
## 0.17376355 0.7004855 0.4069132
## 0.19307062 0.7004855 0.4069132
## 0.21237768 0.7004855 0.4069132
## 0.23168474 0.7004855 0.4069132
## 0.25099180 0.7004855 0.4069132
## 0.27029886 0.7004855 0.4069132
## 0.28960592 0.7004855 0.4069132
## 0.30891299 0.7004855 0.4069132
## 0.32822005 0.7004855 0.4069132
## 0.34752711 0.7004855 0.4069132
## 0.36683417 0.6614230 0.3137864
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01930706.
auc(roc_tree)
## Area under the curve: 0.7798
Random Forest: Con un accuracy obtenido entorno al 76% y un AUC de aproximadamente un 92%.
caret.rf
## Random Forest
##
## 1279 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1023, 1022, 1023, 1024, 1024
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7646311 0.5276577
## 3 0.7622874 0.5232098
## 4 0.7552530 0.5094700
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
auc(roc_rf)
## Area under the curve: 0.926
ADABoost: Con un accuracy obtenido entorno al 76% y un AUC de aproximadamente un 89%.
caret.ada
## Boosted Classification Trees
##
## 1279 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1023, 1022, 1024, 1023
## Resampling results across tuning parameters:
##
## maxdepth iter Accuracy Kappa
## 1 50 0.7443213 0.4890732
## 1 100 0.7584084 0.5160080
## 1 150 0.7576149 0.5133790
## 2 50 0.7568276 0.5121128
## 2 100 0.7529335 0.5043760
## 2 150 0.7591927 0.5167245
## 3 50 0.7599647 0.5189156
## 3 100 0.7607643 0.5206920
## 3 150 0.7568488 0.5125098
##
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 100, maxdepth = 3 and nu = 0.1.
auc(roc_ada)
## Area under the curve: 0.8912
XGBoost: Con un accuracy obtenido entorno al 79% y un AUC de aproximadamente un 99%.
caret.xgb
## eXtreme Gradient Boosting
##
## 1279 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1023, 1023, 1024, 1023, 1023
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds Accuracy Kappa
## 0.3 1 0.6 0.50 50 0.7670343 0.5320296
## 0.3 1 0.6 0.50 100 0.7701562 0.5398387
## 0.3 1 0.6 0.50 150 0.7662316 0.5310145
## 0.3 1 0.6 0.75 50 0.7631250 0.5256632
## 0.3 1 0.6 0.75 100 0.7685784 0.5357699
## 0.3 1 0.6 0.75 150 0.7685846 0.5357358
## 0.3 1 0.6 1.00 50 0.7670251 0.5333802
## 0.3 1 0.6 1.00 100 0.7717096 0.5419941
## 0.3 1 0.6 1.00 150 0.7756281 0.5497087
## 0.3 1 0.8 0.50 50 0.7607690 0.5210389
## 0.3 1 0.8 0.50 100 0.7662377 0.5313294
## 0.3 1 0.8 0.50 150 0.7568536 0.5118797
## 0.3 1 0.8 0.75 50 0.7685937 0.5362024
## 0.3 1 0.8 0.75 100 0.7725031 0.5440919
## 0.3 1 0.8 0.75 150 0.7678002 0.5342390
## 0.3 1 0.8 1.00 50 0.7654596 0.5301518
## 0.3 1 0.8 1.00 100 0.7693627 0.5376469
## 0.3 1 0.8 1.00 150 0.7756250 0.5500934
## 0.3 2 0.6 0.50 50 0.7670374 0.5329864
## 0.3 2 0.6 0.50 100 0.7607598 0.5195670
## 0.3 2 0.6 0.50 150 0.7623407 0.5233040
## 0.3 2 0.6 0.75 50 0.7779596 0.5546808
## 0.3 2 0.6 0.75 100 0.7732813 0.5453115
## 0.3 2 0.6 0.75 150 0.7701624 0.5391406
## 0.3 2 0.6 1.00 50 0.7889216 0.5772912
## 0.3 2 0.6 1.00 100 0.7850031 0.5692846
## 0.3 2 0.6 1.00 150 0.7850092 0.5683808
## 0.3 2 0.8 0.50 50 0.7568689 0.5119235
## 0.3 2 0.8 0.50 100 0.7646814 0.5282494
## 0.3 2 0.8 0.50 150 0.7631342 0.5250606
## 0.3 2 0.8 0.75 50 0.7701471 0.5388302
## 0.3 2 0.8 0.75 100 0.7787439 0.5554401
## 0.3 2 0.8 0.75 150 0.7724847 0.5427775
## 0.3 2 0.8 1.00 50 0.7725000 0.5439889
## 0.3 2 0.8 1.00 100 0.7826593 0.5641783
## 0.3 2 0.8 1.00 150 0.7779718 0.5546279
## 0.3 3 0.6 0.50 50 0.7670037 0.5327791
## 0.3 3 0.6 0.50 100 0.7670129 0.5322871
## 0.3 3 0.6 0.50 150 0.7709283 0.5400068
## 0.3 3 0.6 0.75 50 0.7764154 0.5512849
## 0.3 3 0.6 0.75 100 0.7826777 0.5639447
## 0.3 3 0.6 0.75 150 0.7787684 0.5563594
## 0.3 3 0.6 1.00 50 0.7771906 0.5532351
## 0.3 3 0.6 1.00 100 0.7904657 0.5793479
## 0.3 3 0.6 1.00 150 0.7967279 0.5917263
## 0.3 3 0.8 0.50 50 0.7607812 0.5203291
## 0.3 3 0.8 0.50 100 0.7732813 0.5449223
## 0.3 3 0.8 0.50 150 0.7693750 0.5366763
## 0.3 3 0.8 0.75 50 0.7756373 0.5499677
## 0.3 3 0.8 0.75 100 0.7881403 0.5746385
## 0.3 3 0.8 0.75 150 0.7826654 0.5629558
## 0.3 3 0.8 1.00 50 0.7842279 0.5672008
## 0.3 3 0.8 1.00 100 0.7779841 0.5546403
## 0.3 3 0.8 1.00 150 0.7811029 0.5609415
## 0.4 1 0.6 0.50 50 0.7490411 0.4975463
## 0.4 1 0.6 0.50 100 0.7599755 0.5186439
## 0.4 1 0.6 0.50 150 0.7498100 0.4974482
## 0.4 1 0.6 0.75 50 0.7685815 0.5363475
## 0.4 1 0.6 0.75 100 0.7779687 0.5550400
## 0.4 1 0.6 0.75 150 0.7576195 0.5138389
## 0.4 1 0.6 1.00 50 0.7724969 0.5443752
## 0.4 1 0.6 1.00 100 0.7717096 0.5422339
## 0.4 1 0.6 1.00 150 0.7756250 0.5497269
## 0.4 1 0.8 0.50 50 0.7505974 0.4992091
## 0.4 1 0.8 0.50 100 0.7662377 0.5310021
## 0.4 1 0.8 0.50 150 0.7607659 0.5202615
## 0.4 1 0.8 0.75 50 0.7717188 0.5427550
## 0.4 1 0.8 0.75 100 0.7646752 0.5284397
## 0.4 1 0.8 0.75 150 0.7677972 0.5346885
## 0.4 1 0.8 1.00 50 0.7732751 0.5455307
## 0.4 1 0.8 1.00 100 0.7725000 0.5438932
## 0.4 1 0.8 1.00 150 0.7724847 0.5434917
## 0.4 2 0.6 0.50 50 0.7560876 0.5096464
## 0.4 2 0.6 0.50 100 0.7553033 0.5081089
## 0.4 2 0.6 0.50 150 0.7725092 0.5430553
## 0.4 2 0.6 0.75 50 0.7826379 0.5643799
## 0.4 2 0.6 0.75 100 0.7740594 0.5464358
## 0.4 2 0.6 0.75 150 0.7756219 0.5495861
## 0.4 2 0.6 1.00 50 0.7842249 0.5672661
## 0.4 2 0.6 1.00 100 0.7740625 0.5471675
## 0.4 2 0.6 1.00 150 0.7724939 0.5433267
## 0.4 2 0.8 0.50 50 0.7544975 0.5080365
## 0.4 2 0.8 0.50 100 0.7584069 0.5157859
## 0.4 2 0.8 0.50 150 0.7545067 0.5071782
## 0.4 2 0.8 0.75 50 0.7701471 0.5386518
## 0.4 2 0.8 0.75 100 0.7771752 0.5525136
## 0.4 2 0.8 0.75 150 0.7701409 0.5387826
## 0.4 2 0.8 1.00 50 0.7709375 0.5402691
## 0.4 2 0.8 1.00 100 0.7850031 0.5688808
## 0.4 2 0.8 1.00 150 0.7881219 0.5746234
## 0.4 3 0.6 0.50 50 0.7607659 0.5198452
## 0.4 3 0.6 0.50 100 0.7740441 0.5466808
## 0.4 3 0.6 0.50 150 0.7732721 0.5446326
## 0.4 3 0.6 0.75 50 0.7795527 0.5580647
## 0.4 3 0.6 0.75 100 0.7771875 0.5523954
## 0.4 3 0.6 0.75 150 0.7709375 0.5397292
## 0.4 3 0.6 1.00 50 0.7803064 0.5596854
## 0.4 3 0.6 1.00 100 0.7865748 0.5720744
## 0.4 3 0.6 1.00 150 0.7850031 0.5688987
## 0.4 3 0.8 0.50 50 0.7678186 0.5337114
## 0.4 3 0.8 0.50 100 0.7865625 0.5709686
## 0.4 3 0.8 0.50 150 0.7764154 0.5504186
## 0.4 3 0.8 0.75 50 0.7756189 0.5496248
## 0.4 3 0.8 0.75 100 0.7881066 0.5736735
## 0.4 3 0.8 0.75 150 0.7849969 0.5675944
## 0.4 3 0.8 1.00 50 0.7818781 0.5624707
## 0.4 3 0.8 1.00 100 0.7850245 0.5688436
## 0.4 3 0.8 1.00 150 0.7834528 0.5649894
##
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
## parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3, eta
## = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
## = 1.
auc(roc_xgb)
## Area under the curve: 0.9976
#GLM
pred1 <- predict(modelo_glm2, train_glm, type="response")
pred.glm <- prediction(pred1, train_glm$nota_vino)
perf.glm <- performance(pred.glm,"tpr", "fpr")
plot(perf.glm, col = "blue", lwd = 2)
#KNN
pred2 <- predict(caret.knn, newdata = train_knn, type = "prob")
pred22 <- ifelse(pred2$`1` > 0.5, pred2$`1`, 1-pred2$`0`)
pred.knn <- prediction(pred22, train_knn$nota_vino)
perf.knn <- performance(pred.knn,"tpr", "fpr")
plot(perf.knn, add = TRUE, col = "green", lwd = 2)
#DECISION TREE
pred3 <- predict(caret.tree, newdata = train_tree, type = "prob")
pred33 <- ifelse(pred3$`1` > 0.5, pred3$`1`, 1-pred3$`0`)
pred.tree <- prediction(pred33, train_tree$nota_vino)
perf.tree <- performance(pred.tree,"tpr", "fpr")
plot(perf.tree, add = TRUE, col = "yellow", lwd = 2)
#RANDOM FOREST
pred4 <- predict(caret.rf, newdata = train_forest, type = "prob")
pred44 <- ifelse(pred4$`1` > 0.5, pred4$`1`, 1-pred4$`0`)
pred.rf <- prediction(pred44, train_forest$nota_vino)
perf.rf <- performance(pred.rf,"tpr", "fpr")
plot(perf.rf, add = TRUE, col = "red", lwd = 2)
#ADA BOOST
pred5 <- predict(caret.ada, newdata = train_en, type = "prob")
pred55 <- ifelse(pred5$`1` > 0.5, pred5$`1`, 1-pred5$`0`)
pred.ada <- prediction(pred55, train_en$nota_vino)
perf.ada <- performance(pred.ada,"tpr", "fpr")
plot(perf.ada, add = TRUE, col = "orange", lwd = 2)
#XG BOOST
pred6 <- predict(caret.xgb, newdata = train_xgb, type = "prob")
pred66 <- ifelse(pred6$`1` > 0.5, pred6$`1`, 1-pred6$`0`)
pred.xgb <- prediction(pred66, train_xgb$nota_vino)
perf.xgb <- performance(pred.xgb,"tpr", "fpr")
plot(perf.xgb, add = TRUE, col = "purple", lwd = 2)
Realizamos los cambios y modificaciones necesarias sobre el conjunto de datos de test, aplicados previamente sobre nuestro dataset de train.
Imputamos los NAs del data set de test al valor de la mediana de la variable de referencia:
test$pH[is.na(test$pH)] <- median(test$pH, na.rm = TRUE)
test$sulphates[is.na(test$sulphates)] <- median(test$sulphates,
na.rm = TRUE)
summary(test)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 4.70 Min. :0.1600 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.080 1st Qu.: 1.900
## Median : 7.80 Median :0.5200 Median :0.250 Median : 2.150
## Mean : 8.17 Mean :0.5341 Mean :0.262 Mean : 2.486
## 3rd Qu.: 9.00 3rd Qu.:0.6600 3rd Qu.:0.420 3rd Qu.: 2.525
## Max. :15.00 Max. :1.2400 Max. :1.000 Max. :13.800
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.01200 Min. : 3.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.06800 1st Qu.: 7.00 1st Qu.: 20.00 1st Qu.:0.9957
## Median :0.07800 Median :14.00 Median : 37.00 Median :0.9967
## Mean :0.08452 Mean :15.95 Mean : 46.58 Mean :0.9967
## 3rd Qu.:0.08725 3rd Qu.:22.00 3rd Qu.: 63.25 3rd Qu.:0.9977
## Max. :0.61000 Max. :72.00 Max. :160.00 Max. :1.0024
## alcohol quality pH sulphates
## Min. : 8.80 Min. :3.000 Min. :2.740 Min. :0.3900
## 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.:3.210 1st Qu.:0.5500
## Median :10.10 Median :6.000 Median :3.320 Median :0.6200
## Mean :10.39 Mean :5.641 Mean :3.315 Mean :0.6583
## 3rd Qu.:11.00 3rd Qu.:6.000 3rd Qu.:3.400 3rd Qu.:0.7100
## Max. :14.00 Max. :8.000 Max. :3.850 Max. :2.0000
Transformamos a logarítmicas las variables previamente normalizadas:
test <- test %>%
mutate(Log_residual_sugar = log(residual_sugar), Log_chlorides = log(chlorides),
Log_free_sulfur_dioxide = log(free_sulfur_dioxide), Log_total_sulfur_dioxide = log(total_sulfur_dioxide),
Log_sulphates = log(sulphates))
Modificamos nuestro dataset de test para que las variables transformadas a logaritmos sustituyan a las mismas pero aún sin transformar. Tendremos de ese modo un dataset con 12 variables también, pero 5 de ellas transformadas a logaritmos.
test <- test %>%
dplyr::select(-residual_sugar, -chlorides, -free_sulfur_dioxide,
-total_sulfur_dioxide, -sulphates)
head(test)
## # A tibble: 6 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol quality pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 0.998 9.4 5 3.51
## 2 7.3 0.65 0 0.995 10 7 3.39
## 3 8.9 0.22 0.48 0.997 9.4 6 3.39
## 4 7.6 0.41 0.24 0.996 9.5 5 3.28
## 5 7.1 0.71 0 0.997 9.4 5 3.47
## 6 5.7 1.13 0.09 0.994 9.8 4 3.5
## # … with 5 more variables: Log_residual_sugar <dbl>, Log_chlorides <dbl>,
## # Log_free_sulfur_dioxide <dbl>, Log_total_sulfur_dioxide <dbl>,
## # Log_sulphates <dbl>
Comprobación del modelo de Random Forest con los datos de test:
Pasamos a validar la capacidad predictora de nuestro modelo de Random Forest con el conjunto de datos de test. Para ello lo primero de todo, creamos de nuevo la variable binaria “nota_vino” sobre nuestro conjunto de datos en test.
test_forest <- test %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
test_forest
## # A tibble: 320 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 0.998 9.4 3.51
## 2 7.3 0.65 0 0.995 10 3.39
## 3 8.9 0.22 0.48 0.997 9.4 3.39
## 4 7.6 0.41 0.24 0.996 9.5 3.28
## 5 7.1 0.71 0 0.997 9.4 3.47
## 6 5.7 1.13 0.09 0.994 9.8 3.5
## 7 7.3 0.45 0.36 0.998 10.5 3.33
## 8 8.1 0.66 0.22 0.997 10.3 3.3
## 9 6.8 0.67 0.02 0.996 9.5 3.48
## 10 5.6 0.31 0.37 0.995 9.2 3.32
## # … with 310 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(test_forest$nota_vino)
##
## 0 1
## 147 173
str(test_forest)
## tibble [320 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:320] 7.4 7.3 8.9 7.6 7.1 5.7 7.3 8.1 6.8 5.6 ...
## $ volatile_acidity : num [1:320] 0.7 0.65 0.22 0.41 0.71 1.13 0.45 0.66 0.67 0.31 ...
## $ citric_acid : num [1:320] 0 0 0.48 0.24 0 0.09 0.36 0.22 0.02 0.37 ...
## $ density : num [1:320] 0.998 0.995 0.997 0.996 0.997 ...
## $ alcohol : num [1:320] 9.4 10 9.4 9.5 9.4 9.8 10.5 10.3 9.5 9.2 ...
## $ pH : num [1:320] 3.51 3.39 3.39 3.28 3.47 3.5 3.33 3.3 3.48 3.32 ...
## $ Log_residual_sugar : num [1:320] 0.642 0.182 0.588 0.588 0.642 ...
## $ Log_chlorides : num [1:320] -2.58 -2.73 -2.56 -2.53 -2.53 ...
## $ Log_free_sulfur_dioxide : num [1:320] 2.4 2.71 3.37 1.39 2.64 ...
## $ Log_total_sulfur_dioxide: num [1:320] 3.53 3.04 4.09 2.4 3.56 ...
## $ Log_sulphates : num [1:320] -0.58 -0.755 -0.635 -0.528 -0.598 ...
## $ nota_vino : num [1:320] 0 1 1 0 0 0 0 0 0 0 ...
test_forest$y_pred_probs2 <- predict(caret.rf, newdata = test_forest,
type = "prob")
test_forest$y_pred_probs2 <- ifelse(test_forest$y_pred_probs2$`1` >
0.5, test_forest$y_pred_probs2$`1`, 1 - test_forest$y_pred_probs2$`0`)
test_forest$y_pred2 <- ifelse(test_forest$y_pred_probs2 > 0.5,
1, 0)
# test_forest$y_pred_probs2 test_forest$y_pred2 test_forest
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de Random Forest obtenido:
cm_test_forest <- confusionMatrix(as.factor(test_forest$y_pred2),
as.factor(test_forest$nota_vino), positive = "1")
cm_test_forest$table
## Reference
## Prediction 0 1
## 0 113 56
## 1 34 117
# result
cm_test_forest$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.72
cm_test_forest$byClass["Recall"] %>%
round(2)
## Recall
## 0.68
cm_test_forest$byClass["Precision"] %>%
round(2)
## Precision
## 0.77
Comprobamos el área bajo la curva para el modelo de Random Forest en el conjunto de datos de test:
roc_rf_test <- plot.roc(as.numeric(test_forest$nota_vino), as.numeric(test_forest$y_pred_probs2),
col = "red")
auc(roc_rf_test)
## Area under the curve: 0.8095
Comprobación del modelo de XG Boost con los datos de test:
Pasamos a validar la capacidad predictora de nuestro modelo de XG Boost con el conjunto de datos de test. Para ello lo primero de todo, creamos de nuevo la variable binaria “nota_vino” sobre nuestro conjunto de datos en test.
test_xgb <- test %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
test_xgb
## # A tibble: 320 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 0.998 9.4 3.51
## 2 7.3 0.65 0 0.995 10 3.39
## 3 8.9 0.22 0.48 0.997 9.4 3.39
## 4 7.6 0.41 0.24 0.996 9.5 3.28
## 5 7.1 0.71 0 0.997 9.4 3.47
## 6 5.7 1.13 0.09 0.994 9.8 3.5
## 7 7.3 0.45 0.36 0.998 10.5 3.33
## 8 8.1 0.66 0.22 0.997 10.3 3.3
## 9 6.8 0.67 0.02 0.996 9.5 3.48
## 10 5.6 0.31 0.37 0.995 9.2 3.32
## # … with 310 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(test_xgb$nota_vino)
##
## 0 1
## 147 173
str(test_xgb)
## tibble [320 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:320] 7.4 7.3 8.9 7.6 7.1 5.7 7.3 8.1 6.8 5.6 ...
## $ volatile_acidity : num [1:320] 0.7 0.65 0.22 0.41 0.71 1.13 0.45 0.66 0.67 0.31 ...
## $ citric_acid : num [1:320] 0 0 0.48 0.24 0 0.09 0.36 0.22 0.02 0.37 ...
## $ density : num [1:320] 0.998 0.995 0.997 0.996 0.997 ...
## $ alcohol : num [1:320] 9.4 10 9.4 9.5 9.4 9.8 10.5 10.3 9.5 9.2 ...
## $ pH : num [1:320] 3.51 3.39 3.39 3.28 3.47 3.5 3.33 3.3 3.48 3.32 ...
## $ Log_residual_sugar : num [1:320] 0.642 0.182 0.588 0.588 0.642 ...
## $ Log_chlorides : num [1:320] -2.58 -2.73 -2.56 -2.53 -2.53 ...
## $ Log_free_sulfur_dioxide : num [1:320] 2.4 2.71 3.37 1.39 2.64 ...
## $ Log_total_sulfur_dioxide: num [1:320] 3.53 3.04 4.09 2.4 3.56 ...
## $ Log_sulphates : num [1:320] -0.58 -0.755 -0.635 -0.528 -0.598 ...
## $ nota_vino : num [1:320] 0 1 1 0 0 0 0 0 0 0 ...
test_xgb$y_pred_probs2 <- predict(caret.xgb, newdata = test_xgb,
type = "prob")
test_xgb$y_pred_probs2 <- ifelse(test_xgb$y_pred_probs2$`1` >
0.5, test_xgb$y_pred_probs2$`1`, 1 - test_xgb$y_pred_probs2$`0`)
test_xgb$y_pred2 <- ifelse(test_xgb$y_pred_probs2 > 0.5, 1, 0)
# test_xgb$y_pred_probs2 test_xgb$y_pred2 test_xgb
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de XG Boost obtenido:
cm_test_xgb <- confusionMatrix(as.factor(test_xgb$y_pred2), as.factor(test_xgb$nota_vino),
positive = "1")
cm_test_xgb$table
## Reference
## Prediction 0 1
## 0 117 48
## 1 30 125
# result
cm_test_xgb$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.76
cm_test_xgb$byClass["Recall"] %>%
round(2)
## Recall
## 0.72
cm_test_xgb$byClass["Precision"] %>%
round(2)
## Precision
## 0.81
Comprobamos el área bajo la curva para el modelo de XG Boost en el conjunto de datos de test:
roc_xgb_test <- plot.roc(as.numeric(test_xgb$nota_vino), as.numeric(test_xgb$y_pred_probs2),
col = "purple")
auc(roc_xgb_test)
## Area under the curve: 0.8251
Regresión Lineal Mútiple: Vemos una falta de adecuación y ajuste del modelo de regresión lineal múltiple obtenido. Se observa un modelo con unos residuos que presentan heterocedasticidad (varianza no constante en el modelo - se viola la homocedasticidad) y que además no predice de forma adecuada la variable respuesta o dependiente, en base a las variables explicativas o independientes. Al tener una variable dependiente como “quality” que es discreta, un modelo de regresión linela normal no se ajusta a nuestros datos.
Reducción de la Dimensionalidad (PCA y t-SNE):
Regresión Logística (GLM):
KNN:
Decision Tree:
Métodos de Ensamble:
Random Forest:
SVM:
Ajuste de Hiperparámetros (Tuning):
Validación cruzada (Cross Validation):
Clustering (K-means):
GAM: